feat: add BEAM benchmark integration with rubric-based evaluation by Vasilije1990 · Pull Request #2480 · topoteretes/cognee

Vasilije1990 · 2026-03-25T03:06:25Z

Add support for the BEAM long-context conversation benchmark (huggingface.co/datasets/Mohammadta/BEAM) with question-type routing and rubric-based evaluation.

New files:

beam_adapter.py: loads conversations from HuggingFace, extracts 20 probing questions (10 types x 2) with rubrics and golden answers. Supports max_batches param to truncate conversations for local runs.
beam_router.py: routes questions to appropriate retrievers based on type (GraphCompletion for factual, CotRetriever for multi-hop/ contradiction, SummaryRetriever for summarization) with specialized system prompts per category.
rubric.py: LLM-as-judge metric that evaluates each rubric criterion independently (YES/NO per item), returns fraction satisfied. Does not use DeepEval GEval — uses cognee's own LLM client.
run_beam_eval.py: entry point configured for 1 conversation, 100K split, rubric+f1 metrics, max_batches=1 for local runs.

Modified files:

benchmark_adapters.py: register BEAM adapter
deep_eval_adapter.py: add RubricMetric, pass rubric via additional_metadata on LLMTestCase
eval_config.py: add BEAM and beam_router as options
run_question_answering_module.py: route beam_router to BEAMRouter
run_corpus_builder.py: support BEAMAdapter with max_batches

Description

Acceptance Criteria

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Code refactoring
Other (please specify):

Screenshots

Pre-submission Checklist

I have tested my changes thoroughly before submitting this PR (See CONTRIBUTING.md)
This PR contains minimal changes necessary to address the issue/feature
My code follows the project's coding standards and style guidelines
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if applicable)
All new and existing tests pass
I have searched existing PRs to ensure this change hasn't been submitted already
I have linked any relevant issues in the description
My commits have clear and descriptive messages

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.

Summary by CodeRabbit

Release Notes

New Features
- Added support for the BEAM benchmark dataset, enabling evaluation against large-scale long-context conversation corpus with multiple probing questions per conversation
- Introduced rubric-based evaluation metric that independently assesses response quality against specified evaluation criteria
- Added question-type routing system for intelligent selection of answer generation strategies based on question characteristics
- Provided a dedicated BEAM evaluation pipeline script for end-to-end benchmark execution

Add support for the BEAM long-context conversation benchmark (huggingface.co/datasets/Mohammadta/BEAM) with question-type routing and rubric-based evaluation. New files: - beam_adapter.py: loads conversations from HuggingFace, extracts 20 probing questions (10 types x 2) with rubrics and golden answers. Supports max_batches param to truncate conversations for local runs. - beam_router.py: routes questions to appropriate retrievers based on type (GraphCompletion for factual, CotRetriever for multi-hop/ contradiction, SummaryRetriever for summarization) with specialized system prompts per category. - rubric.py: LLM-as-judge metric that evaluates each rubric criterion independently (YES/NO per item), returns fraction satisfied. Does not use DeepEval GEval — uses cognee's own LLM client. - run_beam_eval.py: entry point configured for 1 conversation, 100K split, rubric+f1 metrics, max_batches=1 for local runs. Modified files: - benchmark_adapters.py: register BEAM adapter - deep_eval_adapter.py: add RubricMetric, pass rubric via additional_metadata on LLMTestCase - eval_config.py: add BEAM and beam_router as options - run_question_answering_module.py: route beam_router to BEAMRouter - run_corpus_builder.py: support BEAMAdapter with max_batches Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: vasilije <vas.markovic@gmail.com>

pull-checklist · 2026-03-25T03:06:30Z

Please make sure all the checkboxes are checked:

I have tested these changes locally.
I have reviewed the code changes.
I have added end-to-end and unit tests (if applicable).
I have updated the documentation and README.md file (if necessary).
I have removed unnecessary code and debug statements.
PR title is clear and follows the convention.
I have tagged reviewers or team members for feedback.

coderabbitai · 2026-03-25T03:06:45Z

Walkthrough

This PR adds comprehensive BEAM benchmark evaluation support to the framework. It introduces a dedicated router for BEAM questions, a dataset adapter to load BEAM conversations from HuggingFace, a rubric-based evaluation metric, and an end-to-end evaluation script with interconnected corpus building, question answering, and evaluation workflows.

Changes

Cohort / File(s)	Summary
BEAM Question Routing `cognee/eval_framework/answer_generation/beam_router.py`, `cognee/eval_framework/answer_generation/run_question_answering_module.py`	New `BEAMRouter` class routes BEAM probing questions to type-specific retrievers with per-type system prompts and handles completion/context retrieval; `run_question_answering` conditionally imports and uses `BEAMRouter` when `qa_engine` is `"beam_router"`.
BEAM Dataset & Corpus `cognee/eval_framework/benchmark_adapters/beam_adapter.py`, `cognee/eval_framework/benchmark_adapters/benchmark_adapters.py`, `cognee/eval_framework/corpus_builder/run_corpus_builder.py`	New `BEAMAdapter` loads BEAM conversations from HuggingFace, extracts questions/answers from probing fields, supports golden context extraction; enum `BenchmarkAdapter` extended with `BEAM` member; `run_corpus_builder` conditionally instantiates `BEAMAdapter` with `max_batches` parameter.
Rubric-Based Evaluation Metric `cognee/eval_framework/evaluation/metrics/rubric.py`, `cognee/eval_framework/evaluation/deep_eval_adapter.py`	New `RubricMetric` class independently judges LLM response satisfaction against rubric criteria via async LLM calls, tracks per-criterion verdicts, and computes fractional scores; `deep_eval_adapter` registers `RubricMetric` and passes rubric/question_type metadata to test cases.
Configuration & Documentation `cognee/eval_framework/eval_config.py`	Updated documentation for `benchmark` and `qa_engine` fields to list `"BEAM"` and `"beam_router"` as accepted options respectively.
End-to-End Evaluation Script `cognee/eval_framework/run_beam_eval.py`	New executable entry point that orchestrates BEAM evaluation via async control flow, pre-configures `EvalConfig` with BEAM parameters, sets truncation limit (`max_batches=1`), and sequentially runs corpus building, question answering, evaluation, and optional dashboard generation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

feat: Cognee evaluation framework development #498 — Adds BEAMRouter and BEAM-specific adapters to complement existing AnswerGeneratorExecutor answer-generation infrastructure.
Feature/cog 1312 integrating evaluation framework into dreamify #562 — Both PRs modify run_question_answering_module.py to add conditional branching and parameter handling.

Suggested labels

core-team

Suggested reviewers

hajdul88
lxobr
alekszievr

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The description includes AI-generated summaries and violates the template requirement for human-generated descriptions. Critical template sections (Acceptance Criteria, Type of Change checkboxes, Screenshots, and Pre-submission Checklist) are incomplete or unchecked.	Remove AI-generated content and provide human-written descriptions explaining your reasoning. Complete all required template sections including Acceptance Criteria, Type of Change selection, and Pre-submission Checklist items with proper checkbox markings.
Docstring Coverage	⚠️ Warning	Docstring coverage is 52.38% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly and concisely summarizes the main change: adding BEAM benchmark integration with rubric-based evaluation, which matches the substantial additions across multiple modules.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/beam-benchmark-eval

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (4)

cognee/eval_framework/eval_config.py (1)

17-17: Line exceeds 100 character limit.

The inline comment on line 17 exceeds the 100-character line length guideline. Consider reformatting as a multi-line comment above the field.

Suggested fix

+    # Options: 'cognee_completion', 'cognee_graph_completion',
+    # 'cognee_graph_completion_cot', 'cognee_graph_completion_context_extension', 'beam_router'
-    qa_engine: str = "cognee_graph_completion"  # Options: 'cognee_completion', 'cognee_graph_completion', 'cognee_graph_completion_cot', 'cognee_graph_completion_context_extension', 'beam_router'
+    qa_engine: str = "cognee_graph_completion"

As per coding guidelines: "Maintain line length of 100 characters maximum".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cognee/eval_framework/eval_config.py` at line 17, The inline comment on the
qa_engine field exceeds the 100-char limit; move the long options comment above
the qa_engine declaration as a multi-line comment (or split into multiple
shorter comment lines) so the qa_engine: str = "cognee_graph_completion" line
stays under 100 chars; locate the qa_engine symbol in eval_config.py and replace
the inline options comment with a brief inline note (if needed) and full options
listed on the lines immediately above.

cognee/eval_framework/evaluation/metrics/rubric.py (1)

150-150: Redundant ternary condition.

The if rubric else 0.0 ternary is unreachable since we return early at line 99-103 when rubric is empty. This can be simplified.

Suggested simplification

-        self.score = satisfied / len(rubric) if rubric else 0.0
+        self.score = satisfied / len(rubric)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cognee/eval_framework/evaluation/metrics/rubric.py` at line 150, The
assignment to self.score uses an unreachable ternary (self.score = satisfied /
len(rubric) if rubric else 0.0) because the method already returns early when
rubric is empty; simplify it by removing the redundant conditional and set
self.score = satisfied / len(rubric) in the same location (keep the variable
names satisfied and rubric and the self.score attribute intact).

cognee/eval_framework/run_beam_eval.py (1)

46-53: Add docstring to main() function.

As per coding guidelines, undocumented function definitions are assumed incomplete. A brief docstring describing the evaluation pipeline steps would improve clarity.

Suggested addition

 async def main():
+    """Run end-to-end BEAM benchmark evaluation.
+    
+    Steps: corpus build → question answering → evaluation → dashboard.
+    """
     logger.info("=== BEAM Evaluation: 1 conversation, 100K split ===")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cognee/eval_framework/run_beam_eval.py` around lines 46 - 53, Add a concise
docstring to the async function main() that briefly describes the BEAM
evaluation pipeline and the major steps performed (e.g., building the corpus by
ingesting a conversation, overriding eval_params["_beam_max_batches"] with
BEAM_MAX_BATCHES for faster local runs, and invoking run_corpus_builder),
placing the docstring as the first statement inside main() so tools and readers
can quickly understand the function purpose and high-level flow.

cognee/eval_framework/answer_generation/beam_router.py (1)

163-166: Consider logging exceptions at WARNING or DEBUG level instead of ERROR.

Returning "ERROR: {e}" as answer_text propagates the error gracefully to the evaluation pipeline. However, logging at logger.error may be too severe for expected transient failures (e.g., rate limits). Consider using logger.warning for recoverable per-question failures, reserving logger.error for unexpected fatal conditions.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/eval_framework/answer_generation/beam_router.py` around lines 163 -
166, The except block in beam_router.py currently logs per-question failures
with logger.error and sets answer_text to "ERROR: {e}"; change the log level to
logger.warning (or logger.debug) in the except handler where query_text,
answer_text and retrieval_context are set so transient/recoverable failures are
not treated as fatal—i.e., update the exception logging call that references
query_text[:80] to use logger.warning and keep the existing assignment to
answer_text and retrieval_context in the same except block.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cognee/eval_framework/benchmark_adapters/beam_adapter.py`:
- Around line 119-125: The truncation currently applied to chat_batches (when
self.max_batches is set) only limits the corpus_text via _flatten_chat but does
not restrict the set of probing/evaluated questions emitted later, causing
questions to reference removed context; modify the emission logic so that the
probing question list is filtered or truncated to match the same chat_batches
window (e.g., trim the questions tied to chat_batches to self.max_batches or
derive questions from the truncated corpus_text) before the code that
emits/evaluates questions, ensuring variables like chat_batches, corpus_text,
and the probing question collection are kept in sync.
- Around line 109-113: The current bounds check only rejects indices >= len(ds)
and allows negative Python-style indexing; update the bounds validation where
conversation_index is checked (the block referencing self.conversation_index,
self.split and ds) to explicitly reject negative indices by raising an
IndexError when self.conversation_index < 0, using a consistent error message
(e.g., "conversation_index={...} out of range (split '{self.split}' has
{len(ds)} conversations)"). Ensure both upper and lower bounds are validated
before proceeding.
- Around line 152-160: The rubric variable is only normalized when it's a
string, allowing None or other types through; update the normalization before
building qa_pair so rubric is always a stable list: if rubric is None set it to
[], if it's a str wrap it in a list, and for any non-list types coerce/wrap them
into a single-item list; ensure this normalization occurs just before
constructing qa_pair (the rubric variable used in the qa_pair dict) so
downstream rubric scoring always receives a list.

In `@cognee/eval_framework/evaluation/metrics/rubric.py`:
- Around line 117-122: The call to llm_client.acreate_structured_output is
passing response_model=str which is invalid; define a Pydantic BaseModel (e.g.,
VerdictModel with a field like verdict: Literal["YES","NO"] or an Enum) and pass
that class as response_model instead, then update handling of judge_response
(and any downstream reads) to access the model field (e.g.,
judge_response.verdict) and keep the rest of the call (prompt,
_JUDGE_SYSTEM_PROMPT) unchanged; ensure the new model is imported/defined in
rubric.py and used where acreate_structured_output is invoked.

In `@cognee/eval_framework/run_beam_eval.py`:
- Around line 76-82: The current manual event loop creation (loop =
asyncio.new_event_loop(); asyncio.set_event_loop(loop);
loop.run_until_complete(main())) never closes the loop — either replace the
whole manual pattern with asyncio.run(main()) to let asyncio handle
creation/cleanup, or if you must keep the manual approach around the main()
call, add loop.close() in the finally block after printing (or instead of the
print) to ensure resources are released; update references to the loop variable
accordingly (main(), loop).

---

Nitpick comments:
In `@cognee/eval_framework/answer_generation/beam_router.py`:
- Around line 163-166: The except block in beam_router.py currently logs
per-question failures with logger.error and sets answer_text to "ERROR: {e}";
change the log level to logger.warning (or logger.debug) in the except handler
where query_text, answer_text and retrieval_context are set so
transient/recoverable failures are not treated as fatal—i.e., update the
exception logging call that references query_text[:80] to use logger.warning and
keep the existing assignment to answer_text and retrieval_context in the same
except block.

In `@cognee/eval_framework/eval_config.py`:
- Line 17: The inline comment on the qa_engine field exceeds the 100-char limit;
move the long options comment above the qa_engine declaration as a multi-line
comment (or split into multiple shorter comment lines) so the qa_engine: str =
"cognee_graph_completion" line stays under 100 chars; locate the qa_engine
symbol in eval_config.py and replace the inline options comment with a brief
inline note (if needed) and full options listed on the lines immediately above.

In `@cognee/eval_framework/evaluation/metrics/rubric.py`:
- Line 150: The assignment to self.score uses an unreachable ternary (self.score
= satisfied / len(rubric) if rubric else 0.0) because the method already returns
early when rubric is empty; simplify it by removing the redundant conditional
and set self.score = satisfied / len(rubric) in the same location (keep the
variable names satisfied and rubric and the self.score attribute intact).

In `@cognee/eval_framework/run_beam_eval.py`:
- Around line 46-53: Add a concise docstring to the async function main() that
briefly describes the BEAM evaluation pipeline and the major steps performed
(e.g., building the corpus by ingesting a conversation, overriding
eval_params["_beam_max_batches"] with BEAM_MAX_BATCHES for faster local runs,
and invoking run_corpus_builder), placing the docstring as the first statement
inside main() so tools and readers can quickly understand the function purpose
and high-level flow.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b6886241-0b6a-490f-b38e-cd56d9b388b2

📥 Commits

Reviewing files that changed from the base of the PR and between 7f5db08 and 27da9ff.

📒 Files selected for processing (9)

cognee/eval_framework/answer_generation/beam_router.py
cognee/eval_framework/answer_generation/run_question_answering_module.py
cognee/eval_framework/benchmark_adapters/beam_adapter.py
cognee/eval_framework/benchmark_adapters/benchmark_adapters.py
cognee/eval_framework/corpus_builder/run_corpus_builder.py
cognee/eval_framework/eval_config.py
cognee/eval_framework/evaluation/deep_eval_adapter.py
cognee/eval_framework/evaluation/metrics/rubric.py
cognee/eval_framework/run_beam_eval.py

coderabbitai · 2026-03-25T03:13:23Z

cognee/eval_framework/benchmark_adapters/beam_adapter.py

+        if self.conversation_index >= len(ds):
+            raise IndexError(
+                f"conversation_index={self.conversation_index} out of range "
+                f"(split '{self.split}' has {len(ds)} conversations)"
+            )


⚠️ Potential issue | 🟡 Minor

Validate negative conversation_index explicitly.

Line 109 only guards the upper bound. Negative indices currently select from the end of the split, which conflicts with the documented 0-indexed behavior.

💡 Suggested fix

- if self.conversation_index >= len(ds): + if self.conversation_index < 0 or self.conversation_index >= len(ds): raise IndexError( f"conversation_index={self.conversation_index} out of range " f"(split '{self.split}' has {len(ds)} conversations)" )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if self.conversation_index >= len(ds):

raise IndexError(

f"conversation_index={self.conversation_index} out of range "

f"(split '{self.split}' has {len(ds)} conversations)"

)

if self.conversation_index < 0 or self.conversation_index >= len(ds):

raise IndexError(

f"conversation_index={self.conversation_index} out of range "

f"(split '{self.split}' has {len(ds)} conversations)"

)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cognee/eval_framework/benchmark_adapters/beam_adapter.py` around lines 109 - 113, The current bounds check only rejects indices >= len(ds) and allows negative Python-style indexing; update the bounds validation where conversation_index is checked (the block referencing self.conversation_index, self.split and ds) to explicitly reject negative indices by raising an IndexError when self.conversation_index < 0, using a consistent error message (e.g., "conversation_index={...} out of range (split '{self.split}' has {len(ds)} conversations)"). Ensure both upper and lower bounds are validated before proceeding.

coderabbitai · 2026-03-25T03:13:24Z

cognee/eval_framework/benchmark_adapters/beam_adapter.py

+        if self.max_batches is not None and len(chat_batches) > self.max_batches:
+            logger.info(
+                f"Truncating conversation from {len(chat_batches)} batches "
+                f"to {self.max_batches} (max_batches)"
+            )
+            chat_batches = chat_batches[: self.max_batches]
+        corpus_text = _flatten_chat(chat_batches)


⚠️ Potential issue | 🟠 Major

max_batches truncates context but not the evaluated question set.

When max_batches is set, Line 119-Line 125 trims the corpus, but Line 144-Line 172 still emits all probing questions. This can score the model against questions whose evidence was removed, creating systematic false negatives.

💡 Suggested fix

chat_batches = row["chat"] if self.max_batches is not None and len(chat_batches) > self.max_batches: logger.info( f"Truncating conversation from {len(chat_batches)} batches " f"to {self.max_batches} (max_batches)" ) chat_batches = chat_batches[: self.max_batches] + available_msg_ids = { + msg.get("id") + for batch in chat_batches + for msg in batch + if isinstance(msg, dict) and msg.get("id") is not None + } @@ source_ids = q.get("source_chat_ids") + if self.max_batches is not None and source_ids: + referenced = self._collect_source_ids(source_ids) + if referenced and not referenced.issubset(available_msg_ids): + continue if source_ids and load_golden_context: golden = self._extract_golden_context(chat_batches, source_ids) if golden: qa_pair["golden_context"] = golden

class BEAMAdapter(BaseBenchmarkAdapter): + `@staticmethod` + def _collect_source_ids(source_ids: Any) -> set[int]: + ids: set[int] = set() + if isinstance(source_ids, list): + ids.update(i for i in source_ids if isinstance(i, int)) + elif isinstance(source_ids, dict): + for value in source_ids.values(): + if isinstance(value, int): + ids.add(value) + elif isinstance(value, list): + ids.update(i for i in value if isinstance(i, int)) + return ids

Also applies to: 144-172

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cognee/eval_framework/benchmark_adapters/beam_adapter.py` around lines 119 - 125, The truncation currently applied to chat_batches (when self.max_batches is set) only limits the corpus_text via _flatten_chat but does not restrict the set of probing/evaluated questions emitted later, causing questions to reference removed context; modify the emission logic so that the probing question list is filtered or truncated to match the same chat_batches window (e.g., trim the questions tied to chat_batches to self.max_batches or derive questions from the truncated corpus_text) before the code that emits/evaluates questions, ensuring variables like chat_batches, corpus_text, and the probing question collection are kept in sync.

coderabbitai · 2026-03-25T03:13:24Z

cognee/eval_framework/benchmark_adapters/beam_adapter.py

+                rubric = q.get("rubric", [])
+                if isinstance(rubric, str):
+                    rubric = [rubric]
+
+                qa_pair: Dict[str, Any] = {
+                    "question": q["question"],
+                    "answer": answer_text,
+                    "question_type": question_type,
+                    "rubric": rubric,


⚠️ Potential issue | 🟡 Minor

Normalize rubric to a stable list type before emitting metadata.

At Line 152-Line 155, only string rubrics are normalized. If the dataset returns None or another type, downstream rubric scoring can receive unexpected types.

💡 Suggested fix

- rubric = q.get("rubric", []) - if isinstance(rubric, str): - rubric = [rubric] + raw_rubric = q.get("rubric", []) + if isinstance(raw_rubric, str): + rubric = [raw_rubric] + elif isinstance(raw_rubric, list): + rubric = [str(item) for item in raw_rubric if item is not None] + else: + rubric = []

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cognee/eval_framework/benchmark_adapters/beam_adapter.py` around lines 152 - 160, The rubric variable is only normalized when it's a string, allowing None or other types through; update the normalization before building qa_pair so rubric is always a stable list: if rubric is None set it to [], if it's a str wrap it in a list, and for any non-list types coerce/wrap them into a single-item list; ensure this normalization occurs just before constructing qa_pair (the rubric variable used in the qa_pair dict) so downstream rubric scoring always receives a list.

coderabbitai · 2026-03-25T03:13:24Z

cognee/eval_framework/evaluation/metrics/rubric.py

+            try:
+                judge_response = await llm_client.acreate_structured_output(
+                    text_input=prompt,
+                    system_prompt=_JUDGE_SYSTEM_PROMPT,
+                    response_model=str,
+                )


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify acreate_structured_output signature and response_model expectations rg -n "def acreate_structured_output" --type py -A 15 | head -50

Repository: topoteretes/cognee

Length of output: 1802

🏁 Script executed:

#!/bin/bash # Search for acreate_structured_output definition find . -name "*.py" -type f | xargs grep -l "def acreate_structured_output" 2>/dev/null | head -10

Repository: topoteretes/cognee

Length of output: 1887

🏁 Script executed:

#!/bin/bash # Also search for response_model parameter usage to understand expectations rg "response_model\s*=" --type py -B 2 -A 2 | head -100

Repository: topoteretes/cognee

Length of output: 1887

🏁 Script executed:

#!/bin/bash # Look for the rubric.py file to understand the context find . -name "rubric.py" -type f 2>/dev/null

Repository: topoteretes/cognee

Length of output: 1887

🌐 Web query:

cognee framework acreate_structured_output response_model

💡 Result:

In the Cognee framework, acreate_structured_output is an async method in LLMGateway used to extract structured data from LLM responses into Pydantic models. It abstracts structured output backends (LiteLLM + Instructor by default, or BAML). Usage: from cognee.infrastructure.llm.LLMGateway import LLMGateway from pydantic import BaseModel from typing import List class KnowledgeGraph(BaseModel): entities: List[str] relationships: List[str] result = await LLMGateway.acreate_structured_output( text, # input text to process system_prompt, # system prompt for the LLM response_model=KnowledgeGraph # Pydantic model for structured output ) Configure backend via env var STRUCTURED_OUTPUT_FRAMEWORK ("instructor" or "baml"). Both produce identical Pydantic-validated results via the unified interface.

Citations:

1: https://docs.cognee.ai/setup-configuration/structured-output-backends

2: https://docs.cognee.ai/how-to-guides/cognee-sdk/structured-output-overview

3: https://www.cognee.ai/blog/integrations/structured-outputs-with-baml-and-cognee

4: https://docs.cognee.ai/how-to-guides/cognee-sdk/instructor

5: https://docs.cognee.ai/guides/custom-tasks-pipelines

🌐 Web query:

cognee llm_client structured output Pydantic BaseModel

💡 Result:

In Cognee, structured output with Pydantic BaseModel is handled through the LLMGateway class (from cognee.infrastructure.llm.LLMGateway), which serves as the primary interface for LLM interactions including structured outputs. There is no direct "llm_client" class exposed in the public API; instead, LLMGateway provides a unified, backend-agnostic entrypoint that routes to either LiteLLM + Instructor (default) or BAML based on configuration. Key usage for structured output: 1. Define Pydantic models inheriting from BaseModel: from pydantic import BaseModel from typing import List class MiniEntity(BaseModel): name: str type: str class MiniGraph(BaseModel): nodes: List[MiniEntity] 2. Call the async method: import asyncio from cognee.infrastructure.llm.LLMGateway import LLMGateway async def example: text = "Apple develops iPhone; Audi produces the R8." system_prompt = "Extract entities as nodes with name and type." result = await LLMGateway.acreate_structured_output(text, system_prompt, MiniGraph) print(result) # Returns validated MiniGraph instance A synchronous variant create_structured_output also exists. Configuration (via .env or cognee.config): - STRUCTURED_OUTPUT_FRAMEWORK=instructor (default, uses LiteLLM + Instructor for Pydantic coercion) - STRUCTURED_OUTPUT_FRAMEWORK=baml (alternative DSL-based framework) - LLM_PROVIDER=openai (or anthropic, ollama, etc.) - LLM_MODEL=gpt-4o-mini - LLM_API_KEY=sk-... Both frameworks produce identical Pydantic-validated outputs via the same API. LLMGateway abstracts LiteLLM for 100+ providers with rate limiting and retries. Install extras: pip install cognee[litellm] or cognee[baml]. This is used internally for knowledge graph extraction and custom pipelines.

Citations:

1: https://docs.cognee.ai/how-to-guides/cognee-sdk/structured-output-overview

2: https://docs.cognee.ai/guides/low-level-llm

3: https://docs.cognee.ai/setup-configuration/structured-output-backends

4: https://docs.cognee.ai/how-to-guides/cognee-sdk/litellm

Critical: response_model=str is invalid for acreate_structured_output.

The acreate_structured_output method in Cognee's LLMGateway requires response_model to be a Pydantic BaseModel class, not a primitive type like str. Passing str will cause a validation error. Define a Pydantic model for the YES/NO verdict instead.

Proposed fix

+from pydantic import BaseModel + +class JudgeVerdict(BaseModel): + verdict: str # "YES" or "NO" + # In a_measure(): try: judge_response = await llm_client.acreate_structured_output( text_input=prompt, system_prompt=_JUDGE_SYSTEM_PROMPT, - response_model=str, + response_model=JudgeVerdict, ) - answer = str(judge_response).strip().upper() + answer = judge_response.verdict.strip().upper()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cognee/eval_framework/evaluation/metrics/rubric.py` around lines 117 - 122, The call to llm_client.acreate_structured_output is passing response_model=str which is invalid; define a Pydantic BaseModel (e.g., VerdictModel with a field like verdict: Literal["YES","NO"] or an Enum) and pass that class as response_model instead, then update handling of judge_response (and any downstream reads) to access the model field (e.g., judge_response.verdict) and keep the rest of the call (prompt, _JUDGE_SYSTEM_PROMPT) unchanged; ensure the new model is imported/defined in rubric.py and used where acreate_structured_output is invoked.

coderabbitai · 2026-03-25T03:13:24Z

cognee/eval_framework/run_beam_eval.py

+if __name__ == "__main__":
+    loop = asyncio.new_event_loop()
+    asyncio.set_event_loop(loop)
+    try:
+        loop.run_until_complete(main())
+    finally:
+        print("Done")


⚠️ Potential issue | 🟡 Minor

Missing loop.close() and consider using asyncio.run() instead.

The event loop created at line 77 is not closed in the finally block, which could lead to resource warnings. Additionally, manual loop management is unnecessary here — asyncio.run() handles loop creation, execution, and cleanup automatically.

Proposed fix

if __name__ == "__main__": - loop = asyncio.new_event_loop() - asyncio.set_event_loop(loop) - try: - loop.run_until_complete(main()) - finally: - print("Done") + asyncio.run(main()) + print("Done")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cognee/eval_framework/run_beam_eval.py` around lines 76 - 82, The current manual event loop creation (loop = asyncio.new_event_loop(); asyncio.set_event_loop(loop); loop.run_until_complete(main())) never closes the loop — either replace the whole manual pattern with asyncio.run(main()) to let asyncio handle creation/cleanup, or if you must keep the manual approach around the main() call, add loop.close() in the finally block after printing (or instead of the print) to ensure resources are released; update references to the loop variable accordingly (main(), loop).

coderabbitai bot reviewed Mar 25, 2026

View reviewed changes

Vasilije1990 merged commit a1ae694 into dev Apr 5, 2026
153 of 154 checks passed

Vasilije1990 deleted the feature/beam-benchmark-eval branch April 5, 2026 00:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add BEAM benchmark integration with rubric-based evaluation#2480

feat: add BEAM benchmark integration with rubric-based evaluation#2480
Vasilije1990 merged 1 commit intodevfrom
feature/beam-benchmark-eval

Vasilije1990 commented Mar 25, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

pull-checklist bot commented Mar 25, 2026

Uh oh!

coderabbitai bot commented Mar 25, 2026 •

edited

Loading

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 25, 2026

Uh oh!

coderabbitai bot Mar 25, 2026

Uh oh!

coderabbitai bot Mar 25, 2026

Uh oh!

coderabbitai bot Mar 25, 2026

Uh oh!

coderabbitai bot Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Vasilije1990 commented Mar 25, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Acceptance Criteria

Type of Change

Screenshots

Pre-submission Checklist

DCO Affirmation

Summary by CodeRabbit

Release Notes

Uh oh!

pull-checklist bot commented Mar 25, 2026

Please make sure all the checkboxes are checked:

Uh oh!

coderabbitai bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Vasilije1990 commented Mar 25, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 25, 2026 •

edited

Loading