feat: add BEAM benchmark integration with rubric-based evaluation#2480
feat: add BEAM benchmark integration with rubric-based evaluation#2480Vasilije1990 merged 1 commit intodevfrom
Conversation
Add support for the BEAM long-context conversation benchmark (huggingface.co/datasets/Mohammadta/BEAM) with question-type routing and rubric-based evaluation. New files: - beam_adapter.py: loads conversations from HuggingFace, extracts 20 probing questions (10 types x 2) with rubrics and golden answers. Supports max_batches param to truncate conversations for local runs. - beam_router.py: routes questions to appropriate retrievers based on type (GraphCompletion for factual, CotRetriever for multi-hop/ contradiction, SummaryRetriever for summarization) with specialized system prompts per category. - rubric.py: LLM-as-judge metric that evaluates each rubric criterion independently (YES/NO per item), returns fraction satisfied. Does not use DeepEval GEval — uses cognee's own LLM client. - run_beam_eval.py: entry point configured for 1 conversation, 100K split, rubric+f1 metrics, max_batches=1 for local runs. Modified files: - benchmark_adapters.py: register BEAM adapter - deep_eval_adapter.py: add RubricMetric, pass rubric via additional_metadata on LLMTestCase - eval_config.py: add BEAM and beam_router as options - run_question_answering_module.py: route beam_router to BEAMRouter - run_corpus_builder.py: support BEAMAdapter with max_batches Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: vasilije <vas.markovic@gmail.com>
Please make sure all the checkboxes are checked:
|
WalkthroughThis PR adds comprehensive BEAM benchmark evaluation support to the framework. It introduces a dedicated router for BEAM questions, a dataset adapter to load BEAM conversations from HuggingFace, a rubric-based evaluation metric, and an end-to-end evaluation script with interconnected corpus building, question answering, and evaluation workflows. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (4)
cognee/eval_framework/eval_config.py (1)
17-17: Line exceeds 100 character limit.The inline comment on line 17 exceeds the 100-character line length guideline. Consider reformatting as a multi-line comment above the field.
Suggested fix
+ # Options: 'cognee_completion', 'cognee_graph_completion', + # 'cognee_graph_completion_cot', 'cognee_graph_completion_context_extension', 'beam_router' - qa_engine: str = "cognee_graph_completion" # Options: 'cognee_completion', 'cognee_graph_completion', 'cognee_graph_completion_cot', 'cognee_graph_completion_context_extension', 'beam_router' + qa_engine: str = "cognee_graph_completion"As per coding guidelines: "Maintain line length of 100 characters maximum".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cognee/eval_framework/eval_config.py` at line 17, The inline comment on the qa_engine field exceeds the 100-char limit; move the long options comment above the qa_engine declaration as a multi-line comment (or split into multiple shorter comment lines) so the qa_engine: str = "cognee_graph_completion" line stays under 100 chars; locate the qa_engine symbol in eval_config.py and replace the inline options comment with a brief inline note (if needed) and full options listed on the lines immediately above.cognee/eval_framework/evaluation/metrics/rubric.py (1)
150-150: Redundant ternary condition.The
if rubric else 0.0ternary is unreachable since we return early at line 99-103 whenrubricis empty. This can be simplified.Suggested simplification
- self.score = satisfied / len(rubric) if rubric else 0.0 + self.score = satisfied / len(rubric)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cognee/eval_framework/evaluation/metrics/rubric.py` at line 150, The assignment to self.score uses an unreachable ternary (self.score = satisfied / len(rubric) if rubric else 0.0) because the method already returns early when rubric is empty; simplify it by removing the redundant conditional and set self.score = satisfied / len(rubric) in the same location (keep the variable names satisfied and rubric and the self.score attribute intact).cognee/eval_framework/run_beam_eval.py (1)
46-53: Add docstring tomain()function.As per coding guidelines, undocumented function definitions are assumed incomplete. A brief docstring describing the evaluation pipeline steps would improve clarity.
Suggested addition
async def main(): + """Run end-to-end BEAM benchmark evaluation. + + Steps: corpus build → question answering → evaluation → dashboard. + """ logger.info("=== BEAM Evaluation: 1 conversation, 100K split ===")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cognee/eval_framework/run_beam_eval.py` around lines 46 - 53, Add a concise docstring to the async function main() that briefly describes the BEAM evaluation pipeline and the major steps performed (e.g., building the corpus by ingesting a conversation, overriding eval_params["_beam_max_batches"] with BEAM_MAX_BATCHES for faster local runs, and invoking run_corpus_builder), placing the docstring as the first statement inside main() so tools and readers can quickly understand the function purpose and high-level flow.cognee/eval_framework/answer_generation/beam_router.py (1)
163-166: Consider logging exceptions at WARNING or DEBUG level instead of ERROR.Returning
"ERROR: {e}"asanswer_textpropagates the error gracefully to the evaluation pipeline. However, logging atlogger.errormay be too severe for expected transient failures (e.g., rate limits). Consider usinglogger.warningfor recoverable per-question failures, reservinglogger.errorfor unexpected fatal conditions.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cognee/eval_framework/answer_generation/beam_router.py` around lines 163 - 166, The except block in beam_router.py currently logs per-question failures with logger.error and sets answer_text to "ERROR: {e}"; change the log level to logger.warning (or logger.debug) in the except handler where query_text, answer_text and retrieval_context are set so transient/recoverable failures are not treated as fatal—i.e., update the exception logging call that references query_text[:80] to use logger.warning and keep the existing assignment to answer_text and retrieval_context in the same except block.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cognee/eval_framework/benchmark_adapters/beam_adapter.py`:
- Around line 119-125: The truncation currently applied to chat_batches (when
self.max_batches is set) only limits the corpus_text via _flatten_chat but does
not restrict the set of probing/evaluated questions emitted later, causing
questions to reference removed context; modify the emission logic so that the
probing question list is filtered or truncated to match the same chat_batches
window (e.g., trim the questions tied to chat_batches to self.max_batches or
derive questions from the truncated corpus_text) before the code that
emits/evaluates questions, ensuring variables like chat_batches, corpus_text,
and the probing question collection are kept in sync.
- Around line 109-113: The current bounds check only rejects indices >= len(ds)
and allows negative Python-style indexing; update the bounds validation where
conversation_index is checked (the block referencing self.conversation_index,
self.split and ds) to explicitly reject negative indices by raising an
IndexError when self.conversation_index < 0, using a consistent error message
(e.g., "conversation_index={...} out of range (split '{self.split}' has
{len(ds)} conversations)"). Ensure both upper and lower bounds are validated
before proceeding.
- Around line 152-160: The rubric variable is only normalized when it's a
string, allowing None or other types through; update the normalization before
building qa_pair so rubric is always a stable list: if rubric is None set it to
[], if it's a str wrap it in a list, and for any non-list types coerce/wrap them
into a single-item list; ensure this normalization occurs just before
constructing qa_pair (the rubric variable used in the qa_pair dict) so
downstream rubric scoring always receives a list.
In `@cognee/eval_framework/evaluation/metrics/rubric.py`:
- Around line 117-122: The call to llm_client.acreate_structured_output is
passing response_model=str which is invalid; define a Pydantic BaseModel (e.g.,
VerdictModel with a field like verdict: Literal["YES","NO"] or an Enum) and pass
that class as response_model instead, then update handling of judge_response
(and any downstream reads) to access the model field (e.g.,
judge_response.verdict) and keep the rest of the call (prompt,
_JUDGE_SYSTEM_PROMPT) unchanged; ensure the new model is imported/defined in
rubric.py and used where acreate_structured_output is invoked.
In `@cognee/eval_framework/run_beam_eval.py`:
- Around line 76-82: The current manual event loop creation (loop =
asyncio.new_event_loop(); asyncio.set_event_loop(loop);
loop.run_until_complete(main())) never closes the loop — either replace the
whole manual pattern with asyncio.run(main()) to let asyncio handle
creation/cleanup, or if you must keep the manual approach around the main()
call, add loop.close() in the finally block after printing (or instead of the
print) to ensure resources are released; update references to the loop variable
accordingly (main(), loop).
---
Nitpick comments:
In `@cognee/eval_framework/answer_generation/beam_router.py`:
- Around line 163-166: The except block in beam_router.py currently logs
per-question failures with logger.error and sets answer_text to "ERROR: {e}";
change the log level to logger.warning (or logger.debug) in the except handler
where query_text, answer_text and retrieval_context are set so
transient/recoverable failures are not treated as fatal—i.e., update the
exception logging call that references query_text[:80] to use logger.warning and
keep the existing assignment to answer_text and retrieval_context in the same
except block.
In `@cognee/eval_framework/eval_config.py`:
- Line 17: The inline comment on the qa_engine field exceeds the 100-char limit;
move the long options comment above the qa_engine declaration as a multi-line
comment (or split into multiple shorter comment lines) so the qa_engine: str =
"cognee_graph_completion" line stays under 100 chars; locate the qa_engine
symbol in eval_config.py and replace the inline options comment with a brief
inline note (if needed) and full options listed on the lines immediately above.
In `@cognee/eval_framework/evaluation/metrics/rubric.py`:
- Line 150: The assignment to self.score uses an unreachable ternary (self.score
= satisfied / len(rubric) if rubric else 0.0) because the method already returns
early when rubric is empty; simplify it by removing the redundant conditional
and set self.score = satisfied / len(rubric) in the same location (keep the
variable names satisfied and rubric and the self.score attribute intact).
In `@cognee/eval_framework/run_beam_eval.py`:
- Around line 46-53: Add a concise docstring to the async function main() that
briefly describes the BEAM evaluation pipeline and the major steps performed
(e.g., building the corpus by ingesting a conversation, overriding
eval_params["_beam_max_batches"] with BEAM_MAX_BATCHES for faster local runs,
and invoking run_corpus_builder), placing the docstring as the first statement
inside main() so tools and readers can quickly understand the function purpose
and high-level flow.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b6886241-0b6a-490f-b38e-cd56d9b388b2
📒 Files selected for processing (9)
cognee/eval_framework/answer_generation/beam_router.pycognee/eval_framework/answer_generation/run_question_answering_module.pycognee/eval_framework/benchmark_adapters/beam_adapter.pycognee/eval_framework/benchmark_adapters/benchmark_adapters.pycognee/eval_framework/corpus_builder/run_corpus_builder.pycognee/eval_framework/eval_config.pycognee/eval_framework/evaluation/deep_eval_adapter.pycognee/eval_framework/evaluation/metrics/rubric.pycognee/eval_framework/run_beam_eval.py
| if self.conversation_index >= len(ds): | ||
| raise IndexError( | ||
| f"conversation_index={self.conversation_index} out of range " | ||
| f"(split '{self.split}' has {len(ds)} conversations)" | ||
| ) |
There was a problem hiding this comment.
Validate negative conversation_index explicitly.
Line 109 only guards the upper bound. Negative indices currently select from the end of the split, which conflicts with the documented 0-indexed behavior.
💡 Suggested fix
- if self.conversation_index >= len(ds):
+ if self.conversation_index < 0 or self.conversation_index >= len(ds):
raise IndexError(
f"conversation_index={self.conversation_index} out of range "
f"(split '{self.split}' has {len(ds)} conversations)"
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if self.conversation_index >= len(ds): | |
| raise IndexError( | |
| f"conversation_index={self.conversation_index} out of range " | |
| f"(split '{self.split}' has {len(ds)} conversations)" | |
| ) | |
| if self.conversation_index < 0 or self.conversation_index >= len(ds): | |
| raise IndexError( | |
| f"conversation_index={self.conversation_index} out of range " | |
| f"(split '{self.split}' has {len(ds)} conversations)" | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cognee/eval_framework/benchmark_adapters/beam_adapter.py` around lines 109 -
113, The current bounds check only rejects indices >= len(ds) and allows
negative Python-style indexing; update the bounds validation where
conversation_index is checked (the block referencing self.conversation_index,
self.split and ds) to explicitly reject negative indices by raising an
IndexError when self.conversation_index < 0, using a consistent error message
(e.g., "conversation_index={...} out of range (split '{self.split}' has
{len(ds)} conversations)"). Ensure both upper and lower bounds are validated
before proceeding.
| if self.max_batches is not None and len(chat_batches) > self.max_batches: | ||
| logger.info( | ||
| f"Truncating conversation from {len(chat_batches)} batches " | ||
| f"to {self.max_batches} (max_batches)" | ||
| ) | ||
| chat_batches = chat_batches[: self.max_batches] | ||
| corpus_text = _flatten_chat(chat_batches) |
There was a problem hiding this comment.
max_batches truncates context but not the evaluated question set.
When max_batches is set, Line 119-Line 125 trims the corpus, but Line 144-Line 172 still emits all probing questions. This can score the model against questions whose evidence was removed, creating systematic false negatives.
💡 Suggested fix
chat_batches = row["chat"]
if self.max_batches is not None and len(chat_batches) > self.max_batches:
logger.info(
f"Truncating conversation from {len(chat_batches)} batches "
f"to {self.max_batches} (max_batches)"
)
chat_batches = chat_batches[: self.max_batches]
+ available_msg_ids = {
+ msg.get("id")
+ for batch in chat_batches
+ for msg in batch
+ if isinstance(msg, dict) and msg.get("id") is not None
+ }
@@
source_ids = q.get("source_chat_ids")
+ if self.max_batches is not None and source_ids:
+ referenced = self._collect_source_ids(source_ids)
+ if referenced and not referenced.issubset(available_msg_ids):
+ continue
if source_ids and load_golden_context:
golden = self._extract_golden_context(chat_batches, source_ids)
if golden:
qa_pair["golden_context"] = golden class BEAMAdapter(BaseBenchmarkAdapter):
+ `@staticmethod`
+ def _collect_source_ids(source_ids: Any) -> set[int]:
+ ids: set[int] = set()
+ if isinstance(source_ids, list):
+ ids.update(i for i in source_ids if isinstance(i, int))
+ elif isinstance(source_ids, dict):
+ for value in source_ids.values():
+ if isinstance(value, int):
+ ids.add(value)
+ elif isinstance(value, list):
+ ids.update(i for i in value if isinstance(i, int))
+ return idsAlso applies to: 144-172
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cognee/eval_framework/benchmark_adapters/beam_adapter.py` around lines 119 -
125, The truncation currently applied to chat_batches (when self.max_batches is
set) only limits the corpus_text via _flatten_chat but does not restrict the set
of probing/evaluated questions emitted later, causing questions to reference
removed context; modify the emission logic so that the probing question list is
filtered or truncated to match the same chat_batches window (e.g., trim the
questions tied to chat_batches to self.max_batches or derive questions from the
truncated corpus_text) before the code that emits/evaluates questions, ensuring
variables like chat_batches, corpus_text, and the probing question collection
are kept in sync.
| rubric = q.get("rubric", []) | ||
| if isinstance(rubric, str): | ||
| rubric = [rubric] | ||
|
|
||
| qa_pair: Dict[str, Any] = { | ||
| "question": q["question"], | ||
| "answer": answer_text, | ||
| "question_type": question_type, | ||
| "rubric": rubric, |
There was a problem hiding this comment.
Normalize rubric to a stable list type before emitting metadata.
At Line 152-Line 155, only string rubrics are normalized. If the dataset returns None or another type, downstream rubric scoring can receive unexpected types.
💡 Suggested fix
- rubric = q.get("rubric", [])
- if isinstance(rubric, str):
- rubric = [rubric]
+ raw_rubric = q.get("rubric", [])
+ if isinstance(raw_rubric, str):
+ rubric = [raw_rubric]
+ elif isinstance(raw_rubric, list):
+ rubric = [str(item) for item in raw_rubric if item is not None]
+ else:
+ rubric = []🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cognee/eval_framework/benchmark_adapters/beam_adapter.py` around lines 152 -
160, The rubric variable is only normalized when it's a string, allowing None or
other types through; update the normalization before building qa_pair so rubric
is always a stable list: if rubric is None set it to [], if it's a str wrap it
in a list, and for any non-list types coerce/wrap them into a single-item list;
ensure this normalization occurs just before constructing qa_pair (the rubric
variable used in the qa_pair dict) so downstream rubric scoring always receives
a list.
| try: | ||
| judge_response = await llm_client.acreate_structured_output( | ||
| text_input=prompt, | ||
| system_prompt=_JUDGE_SYSTEM_PROMPT, | ||
| response_model=str, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify acreate_structured_output signature and response_model expectations
rg -n "def acreate_structured_output" --type py -A 15 | head -50Repository: topoteretes/cognee
Length of output: 1802
🏁 Script executed:
#!/bin/bash
# Search for acreate_structured_output definition
find . -name "*.py" -type f | xargs grep -l "def acreate_structured_output" 2>/dev/null | head -10Repository: topoteretes/cognee
Length of output: 1887
🏁 Script executed:
#!/bin/bash
# Also search for response_model parameter usage to understand expectations
rg "response_model\s*=" --type py -B 2 -A 2 | head -100Repository: topoteretes/cognee
Length of output: 1887
🏁 Script executed:
#!/bin/bash
# Look for the rubric.py file to understand the context
find . -name "rubric.py" -type f 2>/dev/nullRepository: topoteretes/cognee
Length of output: 1887
🌐 Web query:
cognee framework acreate_structured_output response_model
💡 Result:
In the Cognee framework, acreate_structured_output is an async method in LLMGateway used to extract structured data from LLM responses into Pydantic models. It abstracts structured output backends (LiteLLM + Instructor by default, or BAML). Usage: from cognee.infrastructure.llm.LLMGateway import LLMGateway from pydantic import BaseModel from typing import List class KnowledgeGraph(BaseModel): entities: List[str] relationships: List[str] result = await LLMGateway.acreate_structured_output( text, # input text to process system_prompt, # system prompt for the LLM response_model=KnowledgeGraph # Pydantic model for structured output ) Configure backend via env var STRUCTURED_OUTPUT_FRAMEWORK ("instructor" or "baml"). Both produce identical Pydantic-validated results via the unified interface.
Citations:
- 1: https://docs.cognee.ai/setup-configuration/structured-output-backends
- 2: https://docs.cognee.ai/how-to-guides/cognee-sdk/structured-output-overview
- 3: https://www.cognee.ai/blog/integrations/structured-outputs-with-baml-and-cognee
- 4: https://docs.cognee.ai/how-to-guides/cognee-sdk/instructor
- 5: https://docs.cognee.ai/guides/custom-tasks-pipelines
🌐 Web query:
cognee llm_client structured output Pydantic BaseModel
💡 Result:
In Cognee, structured output with Pydantic BaseModel is handled through the LLMGateway class (from cognee.infrastructure.llm.LLMGateway), which serves as the primary interface for LLM interactions including structured outputs. There is no direct "llm_client" class exposed in the public API; instead, LLMGateway provides a unified, backend-agnostic entrypoint that routes to either LiteLLM + Instructor (default) or BAML based on configuration. Key usage for structured output: 1. Define Pydantic models inheriting from BaseModel: from pydantic import BaseModel from typing import List class MiniEntity(BaseModel): name: str type: str class MiniGraph(BaseModel): nodes: List[MiniEntity] 2. Call the async method: import asyncio from cognee.infrastructure.llm.LLMGateway import LLMGateway async def example: text = "Apple develops iPhone; Audi produces the R8." system_prompt = "Extract entities as nodes with name and type." result = await LLMGateway.acreate_structured_output(text, system_prompt, MiniGraph) print(result) # Returns validated MiniGraph instance A synchronous variant create_structured_output also exists. Configuration (via .env or cognee.config): - STRUCTURED_OUTPUT_FRAMEWORK=instructor (default, uses LiteLLM + Instructor for Pydantic coercion) - STRUCTURED_OUTPUT_FRAMEWORK=baml (alternative DSL-based framework) - LLM_PROVIDER=openai (or anthropic, ollama, etc.) - LLM_MODEL=gpt-4o-mini - LLM_API_KEY=sk-... Both frameworks produce identical Pydantic-validated outputs via the same API. LLMGateway abstracts LiteLLM for 100+ providers with rate limiting and retries. Install extras: pip install cognee[litellm] or cognee[baml]. This is used internally for knowledge graph extraction and custom pipelines.
Citations:
- 1: https://docs.cognee.ai/how-to-guides/cognee-sdk/structured-output-overview
- 2: https://docs.cognee.ai/guides/low-level-llm
- 3: https://docs.cognee.ai/setup-configuration/structured-output-backends
- 4: https://docs.cognee.ai/how-to-guides/cognee-sdk/litellm
Critical: response_model=str is invalid for acreate_structured_output.
The acreate_structured_output method in Cognee's LLMGateway requires response_model to be a Pydantic BaseModel class, not a primitive type like str. Passing str will cause a validation error. Define a Pydantic model for the YES/NO verdict instead.
Proposed fix
+from pydantic import BaseModel
+
+class JudgeVerdict(BaseModel):
+ verdict: str # "YES" or "NO"
+
# In a_measure():
try:
judge_response = await llm_client.acreate_structured_output(
text_input=prompt,
system_prompt=_JUDGE_SYSTEM_PROMPT,
- response_model=str,
+ response_model=JudgeVerdict,
)
- answer = str(judge_response).strip().upper()
+ answer = judge_response.verdict.strip().upper()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cognee/eval_framework/evaluation/metrics/rubric.py` around lines 117 - 122,
The call to llm_client.acreate_structured_output is passing response_model=str
which is invalid; define a Pydantic BaseModel (e.g., VerdictModel with a field
like verdict: Literal["YES","NO"] or an Enum) and pass that class as
response_model instead, then update handling of judge_response (and any
downstream reads) to access the model field (e.g., judge_response.verdict) and
keep the rest of the call (prompt, _JUDGE_SYSTEM_PROMPT) unchanged; ensure the
new model is imported/defined in rubric.py and used where
acreate_structured_output is invoked.
| if __name__ == "__main__": | ||
| loop = asyncio.new_event_loop() | ||
| asyncio.set_event_loop(loop) | ||
| try: | ||
| loop.run_until_complete(main()) | ||
| finally: | ||
| print("Done") |
There was a problem hiding this comment.
Missing loop.close() and consider using asyncio.run() instead.
The event loop created at line 77 is not closed in the finally block, which could lead to resource warnings. Additionally, manual loop management is unnecessary here — asyncio.run() handles loop creation, execution, and cleanup automatically.
Proposed fix
if __name__ == "__main__":
- loop = asyncio.new_event_loop()
- asyncio.set_event_loop(loop)
- try:
- loop.run_until_complete(main())
- finally:
- print("Done")
+ asyncio.run(main())
+ print("Done")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cognee/eval_framework/run_beam_eval.py` around lines 76 - 82, The current
manual event loop creation (loop = asyncio.new_event_loop();
asyncio.set_event_loop(loop); loop.run_until_complete(main())) never closes the
loop — either replace the whole manual pattern with asyncio.run(main()) to let
asyncio handle creation/cleanup, or if you must keep the manual approach around
the main() call, add loop.close() in the finally block after printing (or
instead of the print) to ensure resources are released; update references to the
loop variable accordingly (main(), loop).
Add support for the BEAM long-context conversation benchmark (huggingface.co/datasets/Mohammadta/BEAM) with question-type routing and rubric-based evaluation.
New files:
Modified files:
Description
Acceptance Criteria
Type of Change
Screenshots
Pre-submission Checklist
CONTRIBUTING.md)DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.
Summary by CodeRabbit
Release Notes