Skip to content

feat: add BEAM benchmark integration with rubric-based evaluation#2480

Merged
Vasilije1990 merged 1 commit intodevfrom
feature/beam-benchmark-eval
Apr 5, 2026
Merged

feat: add BEAM benchmark integration with rubric-based evaluation#2480
Vasilije1990 merged 1 commit intodevfrom
feature/beam-benchmark-eval

Conversation

@Vasilije1990
Copy link
Copy Markdown
Contributor

@Vasilije1990 Vasilije1990 commented Mar 25, 2026

Add support for the BEAM long-context conversation benchmark (huggingface.co/datasets/Mohammadta/BEAM) with question-type routing and rubric-based evaluation.

New files:

  • beam_adapter.py: loads conversations from HuggingFace, extracts 20 probing questions (10 types x 2) with rubrics and golden answers. Supports max_batches param to truncate conversations for local runs.
  • beam_router.py: routes questions to appropriate retrievers based on type (GraphCompletion for factual, CotRetriever for multi-hop/ contradiction, SummaryRetriever for summarization) with specialized system prompts per category.
  • rubric.py: LLM-as-judge metric that evaluates each rubric criterion independently (YES/NO per item), returns fraction satisfied. Does not use DeepEval GEval — uses cognee's own LLM client.
  • run_beam_eval.py: entry point configured for 1 conversation, 100K split, rubric+f1 metrics, max_batches=1 for local runs.

Modified files:

  • benchmark_adapters.py: register BEAM adapter
  • deep_eval_adapter.py: add RubricMetric, pass rubric via additional_metadata on LLMTestCase
  • eval_config.py: add BEAM and beam_router as options
  • run_question_answering_module.py: route beam_router to BEAMRouter
  • run_corpus_builder.py: support BEAMAdapter with max_batches

Description

Acceptance Criteria

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Code refactoring
  • Other (please specify):

Screenshots

Pre-submission Checklist

  • I have tested my changes thoroughly before submitting this PR (See CONTRIBUTING.md)
  • This PR contains minimal changes necessary to address the issue/feature
  • My code follows the project's coding standards and style guidelines
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if applicable)
  • All new and existing tests pass
  • I have searched existing PRs to ensure this change hasn't been submitted already
  • I have linked any relevant issues in the description
  • My commits have clear and descriptive messages

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.

Summary by CodeRabbit

Release Notes

  • New Features
    • Added support for the BEAM benchmark dataset, enabling evaluation against large-scale long-context conversation corpus with multiple probing questions per conversation
    • Introduced rubric-based evaluation metric that independently assesses response quality against specified evaluation criteria
    • Added question-type routing system for intelligent selection of answer generation strategies based on question characteristics
    • Provided a dedicated BEAM evaluation pipeline script for end-to-end benchmark execution

Add support for the BEAM long-context conversation benchmark
(huggingface.co/datasets/Mohammadta/BEAM) with question-type routing
and rubric-based evaluation.

New files:
- beam_adapter.py: loads conversations from HuggingFace, extracts
  20 probing questions (10 types x 2) with rubrics and golden answers.
  Supports max_batches param to truncate conversations for local runs.
- beam_router.py: routes questions to appropriate retrievers based on
  type (GraphCompletion for factual, CotRetriever for multi-hop/
  contradiction, SummaryRetriever for summarization) with specialized
  system prompts per category.
- rubric.py: LLM-as-judge metric that evaluates each rubric criterion
  independently (YES/NO per item), returns fraction satisfied. Does not
  use DeepEval GEval — uses cognee's own LLM client.
- run_beam_eval.py: entry point configured for 1 conversation, 100K
  split, rubric+f1 metrics, max_batches=1 for local runs.

Modified files:
- benchmark_adapters.py: register BEAM adapter
- deep_eval_adapter.py: add RubricMetric, pass rubric via
  additional_metadata on LLMTestCase
- eval_config.py: add BEAM and beam_router as options
- run_question_answering_module.py: route beam_router to BEAMRouter
- run_corpus_builder.py: support BEAMAdapter with max_batches

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: vasilije <vas.markovic@gmail.com>
@pull-checklist
Copy link
Copy Markdown

Please make sure all the checkboxes are checked:

  • I have tested these changes locally.
  • I have reviewed the code changes.
  • I have added end-to-end and unit tests (if applicable).
  • I have updated the documentation and README.md file (if necessary).
  • I have removed unnecessary code and debug statements.
  • PR title is clear and follows the convention.
  • I have tagged reviewers or team members for feedback.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 25, 2026

Walkthrough

This PR adds comprehensive BEAM benchmark evaluation support to the framework. It introduces a dedicated router for BEAM questions, a dataset adapter to load BEAM conversations from HuggingFace, a rubric-based evaluation metric, and an end-to-end evaluation script with interconnected corpus building, question answering, and evaluation workflows.

Changes

Cohort / File(s) Summary
BEAM Question Routing
cognee/eval_framework/answer_generation/beam_router.py, cognee/eval_framework/answer_generation/run_question_answering_module.py
New BEAMRouter class routes BEAM probing questions to type-specific retrievers with per-type system prompts and handles completion/context retrieval; run_question_answering conditionally imports and uses BEAMRouter when qa_engine is "beam_router".
BEAM Dataset & Corpus
cognee/eval_framework/benchmark_adapters/beam_adapter.py, cognee/eval_framework/benchmark_adapters/benchmark_adapters.py, cognee/eval_framework/corpus_builder/run_corpus_builder.py
New BEAMAdapter loads BEAM conversations from HuggingFace, extracts questions/answers from probing fields, supports golden context extraction; enum BenchmarkAdapter extended with BEAM member; run_corpus_builder conditionally instantiates BEAMAdapter with max_batches parameter.
Rubric-Based Evaluation Metric
cognee/eval_framework/evaluation/metrics/rubric.py, cognee/eval_framework/evaluation/deep_eval_adapter.py
New RubricMetric class independently judges LLM response satisfaction against rubric criteria via async LLM calls, tracks per-criterion verdicts, and computes fractional scores; deep_eval_adapter registers RubricMetric and passes rubric/question_type metadata to test cases.
Configuration & Documentation
cognee/eval_framework/eval_config.py
Updated documentation for benchmark and qa_engine fields to list "BEAM" and "beam_router" as accepted options respectively.
End-to-End Evaluation Script
cognee/eval_framework/run_beam_eval.py
New executable entry point that orchestrates BEAM evaluation via async control flow, pre-configures EvalConfig with BEAM parameters, sets truncation limit (max_batches=1), and sequentially runs corpus building, question answering, evaluation, and optional dashboard generation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

core-team

Suggested reviewers

  • hajdul88
  • lxobr
  • alekszievr
🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description includes AI-generated summaries and violates the template requirement for human-generated descriptions. Critical template sections (Acceptance Criteria, Type of Change checkboxes, Screenshots, and Pre-submission Checklist) are incomplete or unchecked. Remove AI-generated content and provide human-written descriptions explaining your reasoning. Complete all required template sections including Acceptance Criteria, Type of Change selection, and Pre-submission Checklist items with proper checkbox markings.
Docstring Coverage ⚠️ Warning Docstring coverage is 52.38% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and concisely summarizes the main change: adding BEAM benchmark integration with rubric-based evaluation, which matches the substantial additions across multiple modules.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/beam-benchmark-eval

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (4)
cognee/eval_framework/eval_config.py (1)

17-17: Line exceeds 100 character limit.

The inline comment on line 17 exceeds the 100-character line length guideline. Consider reformatting as a multi-line comment above the field.

Suggested fix
+    # Options: 'cognee_completion', 'cognee_graph_completion',
+    # 'cognee_graph_completion_cot', 'cognee_graph_completion_context_extension', 'beam_router'
-    qa_engine: str = "cognee_graph_completion"  # Options: 'cognee_completion', 'cognee_graph_completion', 'cognee_graph_completion_cot', 'cognee_graph_completion_context_extension', 'beam_router'
+    qa_engine: str = "cognee_graph_completion"

As per coding guidelines: "Maintain line length of 100 characters maximum".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/eval_framework/eval_config.py` at line 17, The inline comment on the
qa_engine field exceeds the 100-char limit; move the long options comment above
the qa_engine declaration as a multi-line comment (or split into multiple
shorter comment lines) so the qa_engine: str = "cognee_graph_completion" line
stays under 100 chars; locate the qa_engine symbol in eval_config.py and replace
the inline options comment with a brief inline note (if needed) and full options
listed on the lines immediately above.
cognee/eval_framework/evaluation/metrics/rubric.py (1)

150-150: Redundant ternary condition.

The if rubric else 0.0 ternary is unreachable since we return early at line 99-103 when rubric is empty. This can be simplified.

Suggested simplification
-        self.score = satisfied / len(rubric) if rubric else 0.0
+        self.score = satisfied / len(rubric)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/eval_framework/evaluation/metrics/rubric.py` at line 150, The
assignment to self.score uses an unreachable ternary (self.score = satisfied /
len(rubric) if rubric else 0.0) because the method already returns early when
rubric is empty; simplify it by removing the redundant conditional and set
self.score = satisfied / len(rubric) in the same location (keep the variable
names satisfied and rubric and the self.score attribute intact).
cognee/eval_framework/run_beam_eval.py (1)

46-53: Add docstring to main() function.

As per coding guidelines, undocumented function definitions are assumed incomplete. A brief docstring describing the evaluation pipeline steps would improve clarity.

Suggested addition
 async def main():
+    """Run end-to-end BEAM benchmark evaluation.
+    
+    Steps: corpus build → question answering → evaluation → dashboard.
+    """
     logger.info("=== BEAM Evaluation: 1 conversation, 100K split ===")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/eval_framework/run_beam_eval.py` around lines 46 - 53, Add a concise
docstring to the async function main() that briefly describes the BEAM
evaluation pipeline and the major steps performed (e.g., building the corpus by
ingesting a conversation, overriding eval_params["_beam_max_batches"] with
BEAM_MAX_BATCHES for faster local runs, and invoking run_corpus_builder),
placing the docstring as the first statement inside main() so tools and readers
can quickly understand the function purpose and high-level flow.
cognee/eval_framework/answer_generation/beam_router.py (1)

163-166: Consider logging exceptions at WARNING or DEBUG level instead of ERROR.

Returning "ERROR: {e}" as answer_text propagates the error gracefully to the evaluation pipeline. However, logging at logger.error may be too severe for expected transient failures (e.g., rate limits). Consider using logger.warning for recoverable per-question failures, reserving logger.error for unexpected fatal conditions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/eval_framework/answer_generation/beam_router.py` around lines 163 -
166, The except block in beam_router.py currently logs per-question failures
with logger.error and sets answer_text to "ERROR: {e}"; change the log level to
logger.warning (or logger.debug) in the except handler where query_text,
answer_text and retrieval_context are set so transient/recoverable failures are
not treated as fatal—i.e., update the exception logging call that references
query_text[:80] to use logger.warning and keep the existing assignment to
answer_text and retrieval_context in the same except block.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cognee/eval_framework/benchmark_adapters/beam_adapter.py`:
- Around line 119-125: The truncation currently applied to chat_batches (when
self.max_batches is set) only limits the corpus_text via _flatten_chat but does
not restrict the set of probing/evaluated questions emitted later, causing
questions to reference removed context; modify the emission logic so that the
probing question list is filtered or truncated to match the same chat_batches
window (e.g., trim the questions tied to chat_batches to self.max_batches or
derive questions from the truncated corpus_text) before the code that
emits/evaluates questions, ensuring variables like chat_batches, corpus_text,
and the probing question collection are kept in sync.
- Around line 109-113: The current bounds check only rejects indices >= len(ds)
and allows negative Python-style indexing; update the bounds validation where
conversation_index is checked (the block referencing self.conversation_index,
self.split and ds) to explicitly reject negative indices by raising an
IndexError when self.conversation_index < 0, using a consistent error message
(e.g., "conversation_index={...} out of range (split '{self.split}' has
{len(ds)} conversations)"). Ensure both upper and lower bounds are validated
before proceeding.
- Around line 152-160: The rubric variable is only normalized when it's a
string, allowing None or other types through; update the normalization before
building qa_pair so rubric is always a stable list: if rubric is None set it to
[], if it's a str wrap it in a list, and for any non-list types coerce/wrap them
into a single-item list; ensure this normalization occurs just before
constructing qa_pair (the rubric variable used in the qa_pair dict) so
downstream rubric scoring always receives a list.

In `@cognee/eval_framework/evaluation/metrics/rubric.py`:
- Around line 117-122: The call to llm_client.acreate_structured_output is
passing response_model=str which is invalid; define a Pydantic BaseModel (e.g.,
VerdictModel with a field like verdict: Literal["YES","NO"] or an Enum) and pass
that class as response_model instead, then update handling of judge_response
(and any downstream reads) to access the model field (e.g.,
judge_response.verdict) and keep the rest of the call (prompt,
_JUDGE_SYSTEM_PROMPT) unchanged; ensure the new model is imported/defined in
rubric.py and used where acreate_structured_output is invoked.

In `@cognee/eval_framework/run_beam_eval.py`:
- Around line 76-82: The current manual event loop creation (loop =
asyncio.new_event_loop(); asyncio.set_event_loop(loop);
loop.run_until_complete(main())) never closes the loop — either replace the
whole manual pattern with asyncio.run(main()) to let asyncio handle
creation/cleanup, or if you must keep the manual approach around the main()
call, add loop.close() in the finally block after printing (or instead of the
print) to ensure resources are released; update references to the loop variable
accordingly (main(), loop).

---

Nitpick comments:
In `@cognee/eval_framework/answer_generation/beam_router.py`:
- Around line 163-166: The except block in beam_router.py currently logs
per-question failures with logger.error and sets answer_text to "ERROR: {e}";
change the log level to logger.warning (or logger.debug) in the except handler
where query_text, answer_text and retrieval_context are set so
transient/recoverable failures are not treated as fatal—i.e., update the
exception logging call that references query_text[:80] to use logger.warning and
keep the existing assignment to answer_text and retrieval_context in the same
except block.

In `@cognee/eval_framework/eval_config.py`:
- Line 17: The inline comment on the qa_engine field exceeds the 100-char limit;
move the long options comment above the qa_engine declaration as a multi-line
comment (or split into multiple shorter comment lines) so the qa_engine: str =
"cognee_graph_completion" line stays under 100 chars; locate the qa_engine
symbol in eval_config.py and replace the inline options comment with a brief
inline note (if needed) and full options listed on the lines immediately above.

In `@cognee/eval_framework/evaluation/metrics/rubric.py`:
- Line 150: The assignment to self.score uses an unreachable ternary (self.score
= satisfied / len(rubric) if rubric else 0.0) because the method already returns
early when rubric is empty; simplify it by removing the redundant conditional
and set self.score = satisfied / len(rubric) in the same location (keep the
variable names satisfied and rubric and the self.score attribute intact).

In `@cognee/eval_framework/run_beam_eval.py`:
- Around line 46-53: Add a concise docstring to the async function main() that
briefly describes the BEAM evaluation pipeline and the major steps performed
(e.g., building the corpus by ingesting a conversation, overriding
eval_params["_beam_max_batches"] with BEAM_MAX_BATCHES for faster local runs,
and invoking run_corpus_builder), placing the docstring as the first statement
inside main() so tools and readers can quickly understand the function purpose
and high-level flow.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b6886241-0b6a-490f-b38e-cd56d9b388b2

📥 Commits

Reviewing files that changed from the base of the PR and between 7f5db08 and 27da9ff.

📒 Files selected for processing (9)
  • cognee/eval_framework/answer_generation/beam_router.py
  • cognee/eval_framework/answer_generation/run_question_answering_module.py
  • cognee/eval_framework/benchmark_adapters/beam_adapter.py
  • cognee/eval_framework/benchmark_adapters/benchmark_adapters.py
  • cognee/eval_framework/corpus_builder/run_corpus_builder.py
  • cognee/eval_framework/eval_config.py
  • cognee/eval_framework/evaluation/deep_eval_adapter.py
  • cognee/eval_framework/evaluation/metrics/rubric.py
  • cognee/eval_framework/run_beam_eval.py

Comment on lines +109 to +113
if self.conversation_index >= len(ds):
raise IndexError(
f"conversation_index={self.conversation_index} out of range "
f"(split '{self.split}' has {len(ds)} conversations)"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Validate negative conversation_index explicitly.

Line 109 only guards the upper bound. Negative indices currently select from the end of the split, which conflicts with the documented 0-indexed behavior.

💡 Suggested fix
-        if self.conversation_index >= len(ds):
+        if self.conversation_index < 0 or self.conversation_index >= len(ds):
             raise IndexError(
                 f"conversation_index={self.conversation_index} out of range "
                 f"(split '{self.split}' has {len(ds)} conversations)"
             )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if self.conversation_index >= len(ds):
raise IndexError(
f"conversation_index={self.conversation_index} out of range "
f"(split '{self.split}' has {len(ds)} conversations)"
)
if self.conversation_index < 0 or self.conversation_index >= len(ds):
raise IndexError(
f"conversation_index={self.conversation_index} out of range "
f"(split '{self.split}' has {len(ds)} conversations)"
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/eval_framework/benchmark_adapters/beam_adapter.py` around lines 109 -
113, The current bounds check only rejects indices >= len(ds) and allows
negative Python-style indexing; update the bounds validation where
conversation_index is checked (the block referencing self.conversation_index,
self.split and ds) to explicitly reject negative indices by raising an
IndexError when self.conversation_index < 0, using a consistent error message
(e.g., "conversation_index={...} out of range (split '{self.split}' has
{len(ds)} conversations)"). Ensure both upper and lower bounds are validated
before proceeding.

Comment on lines +119 to +125
if self.max_batches is not None and len(chat_batches) > self.max_batches:
logger.info(
f"Truncating conversation from {len(chat_batches)} batches "
f"to {self.max_batches} (max_batches)"
)
chat_batches = chat_batches[: self.max_batches]
corpus_text = _flatten_chat(chat_batches)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

max_batches truncates context but not the evaluated question set.

When max_batches is set, Line 119-Line 125 trims the corpus, but Line 144-Line 172 still emits all probing questions. This can score the model against questions whose evidence was removed, creating systematic false negatives.

💡 Suggested fix
         chat_batches = row["chat"]
         if self.max_batches is not None and len(chat_batches) > self.max_batches:
             logger.info(
                 f"Truncating conversation from {len(chat_batches)} batches "
                 f"to {self.max_batches} (max_batches)"
             )
             chat_batches = chat_batches[: self.max_batches]
+        available_msg_ids = {
+            msg.get("id")
+            for batch in chat_batches
+            for msg in batch
+            if isinstance(msg, dict) and msg.get("id") is not None
+        }

@@
                 source_ids = q.get("source_chat_ids")
+                if self.max_batches is not None and source_ids:
+                    referenced = self._collect_source_ids(source_ids)
+                    if referenced and not referenced.issubset(available_msg_ids):
+                        continue
                 if source_ids and load_golden_context:
                     golden = self._extract_golden_context(chat_batches, source_ids)
                     if golden:
                         qa_pair["golden_context"] = golden
 class BEAMAdapter(BaseBenchmarkAdapter):
+    `@staticmethod`
+    def _collect_source_ids(source_ids: Any) -> set[int]:
+        ids: set[int] = set()
+        if isinstance(source_ids, list):
+            ids.update(i for i in source_ids if isinstance(i, int))
+        elif isinstance(source_ids, dict):
+            for value in source_ids.values():
+                if isinstance(value, int):
+                    ids.add(value)
+                elif isinstance(value, list):
+                    ids.update(i for i in value if isinstance(i, int))
+        return ids

Also applies to: 144-172

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/eval_framework/benchmark_adapters/beam_adapter.py` around lines 119 -
125, The truncation currently applied to chat_batches (when self.max_batches is
set) only limits the corpus_text via _flatten_chat but does not restrict the set
of probing/evaluated questions emitted later, causing questions to reference
removed context; modify the emission logic so that the probing question list is
filtered or truncated to match the same chat_batches window (e.g., trim the
questions tied to chat_batches to self.max_batches or derive questions from the
truncated corpus_text) before the code that emits/evaluates questions, ensuring
variables like chat_batches, corpus_text, and the probing question collection
are kept in sync.

Comment on lines +152 to +160
rubric = q.get("rubric", [])
if isinstance(rubric, str):
rubric = [rubric]

qa_pair: Dict[str, Any] = {
"question": q["question"],
"answer": answer_text,
"question_type": question_type,
"rubric": rubric,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Normalize rubric to a stable list type before emitting metadata.

At Line 152-Line 155, only string rubrics are normalized. If the dataset returns None or another type, downstream rubric scoring can receive unexpected types.

💡 Suggested fix
-                rubric = q.get("rubric", [])
-                if isinstance(rubric, str):
-                    rubric = [rubric]
+                raw_rubric = q.get("rubric", [])
+                if isinstance(raw_rubric, str):
+                    rubric = [raw_rubric]
+                elif isinstance(raw_rubric, list):
+                    rubric = [str(item) for item in raw_rubric if item is not None]
+                else:
+                    rubric = []
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/eval_framework/benchmark_adapters/beam_adapter.py` around lines 152 -
160, The rubric variable is only normalized when it's a string, allowing None or
other types through; update the normalization before building qa_pair so rubric
is always a stable list: if rubric is None set it to [], if it's a str wrap it
in a list, and for any non-list types coerce/wrap them into a single-item list;
ensure this normalization occurs just before constructing qa_pair (the rubric
variable used in the qa_pair dict) so downstream rubric scoring always receives
a list.

Comment on lines +117 to +122
try:
judge_response = await llm_client.acreate_structured_output(
text_input=prompt,
system_prompt=_JUDGE_SYSTEM_PROMPT,
response_model=str,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify acreate_structured_output signature and response_model expectations
rg -n "def acreate_structured_output" --type py -A 15 | head -50

Repository: topoteretes/cognee

Length of output: 1802


🏁 Script executed:

#!/bin/bash
# Search for acreate_structured_output definition
find . -name "*.py" -type f | xargs grep -l "def acreate_structured_output" 2>/dev/null | head -10

Repository: topoteretes/cognee

Length of output: 1887


🏁 Script executed:

#!/bin/bash
# Also search for response_model parameter usage to understand expectations
rg "response_model\s*=" --type py -B 2 -A 2 | head -100

Repository: topoteretes/cognee

Length of output: 1887


🏁 Script executed:

#!/bin/bash
# Look for the rubric.py file to understand the context
find . -name "rubric.py" -type f 2>/dev/null

Repository: topoteretes/cognee

Length of output: 1887


🌐 Web query:

cognee framework acreate_structured_output response_model

💡 Result:

In the Cognee framework, acreate_structured_output is an async method in LLMGateway used to extract structured data from LLM responses into Pydantic models. It abstracts structured output backends (LiteLLM + Instructor by default, or BAML). Usage: from cognee.infrastructure.llm.LLMGateway import LLMGateway from pydantic import BaseModel from typing import List class KnowledgeGraph(BaseModel): entities: List[str] relationships: List[str] result = await LLMGateway.acreate_structured_output( text, # input text to process system_prompt, # system prompt for the LLM response_model=KnowledgeGraph # Pydantic model for structured output ) Configure backend via env var STRUCTURED_OUTPUT_FRAMEWORK ("instructor" or "baml"). Both produce identical Pydantic-validated results via the unified interface.

Citations:


🌐 Web query:

cognee llm_client structured output Pydantic BaseModel

💡 Result:

In Cognee, structured output with Pydantic BaseModel is handled through the LLMGateway class (from cognee.infrastructure.llm.LLMGateway), which serves as the primary interface for LLM interactions including structured outputs. There is no direct "llm_client" class exposed in the public API; instead, LLMGateway provides a unified, backend-agnostic entrypoint that routes to either LiteLLM + Instructor (default) or BAML based on configuration. Key usage for structured output: 1. Define Pydantic models inheriting from BaseModel: from pydantic import BaseModel from typing import List class MiniEntity(BaseModel): name: str type: str class MiniGraph(BaseModel): nodes: List[MiniEntity] 2. Call the async method: import asyncio from cognee.infrastructure.llm.LLMGateway import LLMGateway async def example: text = "Apple develops iPhone; Audi produces the R8." system_prompt = "Extract entities as nodes with name and type." result = await LLMGateway.acreate_structured_output(text, system_prompt, MiniGraph) print(result) # Returns validated MiniGraph instance A synchronous variant create_structured_output also exists. Configuration (via .env or cognee.config): - STRUCTURED_OUTPUT_FRAMEWORK=instructor (default, uses LiteLLM + Instructor for Pydantic coercion) - STRUCTURED_OUTPUT_FRAMEWORK=baml (alternative DSL-based framework) - LLM_PROVIDER=openai (or anthropic, ollama, etc.) - LLM_MODEL=gpt-4o-mini - LLM_API_KEY=sk-... Both frameworks produce identical Pydantic-validated outputs via the same API. LLMGateway abstracts LiteLLM for 100+ providers with rate limiting and retries. Install extras: pip install cognee[litellm] or cognee[baml]. This is used internally for knowledge graph extraction and custom pipelines.

Citations:


Critical: response_model=str is invalid for acreate_structured_output.

The acreate_structured_output method in Cognee's LLMGateway requires response_model to be a Pydantic BaseModel class, not a primitive type like str. Passing str will cause a validation error. Define a Pydantic model for the YES/NO verdict instead.

Proposed fix
+from pydantic import BaseModel
+
+class JudgeVerdict(BaseModel):
+    verdict: str  # "YES" or "NO"
+
 # In a_measure():
             try:
                 judge_response = await llm_client.acreate_structured_output(
                     text_input=prompt,
                     system_prompt=_JUDGE_SYSTEM_PROMPT,
-                    response_model=str,
+                    response_model=JudgeVerdict,
                 )
 
-                answer = str(judge_response).strip().upper()
+                answer = judge_response.verdict.strip().upper()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/eval_framework/evaluation/metrics/rubric.py` around lines 117 - 122,
The call to llm_client.acreate_structured_output is passing response_model=str
which is invalid; define a Pydantic BaseModel (e.g., VerdictModel with a field
like verdict: Literal["YES","NO"] or an Enum) and pass that class as
response_model instead, then update handling of judge_response (and any
downstream reads) to access the model field (e.g., judge_response.verdict) and
keep the rest of the call (prompt, _JUDGE_SYSTEM_PROMPT) unchanged; ensure the
new model is imported/defined in rubric.py and used where
acreate_structured_output is invoked.

Comment on lines +76 to +82
if __name__ == "__main__":
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
try:
loop.run_until_complete(main())
finally:
print("Done")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing loop.close() and consider using asyncio.run() instead.

The event loop created at line 77 is not closed in the finally block, which could lead to resource warnings. Additionally, manual loop management is unnecessary here — asyncio.run() handles loop creation, execution, and cleanup automatically.

Proposed fix
 if __name__ == "__main__":
-    loop = asyncio.new_event_loop()
-    asyncio.set_event_loop(loop)
-    try:
-        loop.run_until_complete(main())
-    finally:
-        print("Done")
+    asyncio.run(main())
+    print("Done")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/eval_framework/run_beam_eval.py` around lines 76 - 82, The current
manual event loop creation (loop = asyncio.new_event_loop();
asyncio.set_event_loop(loop); loop.run_until_complete(main())) never closes the
loop — either replace the whole manual pattern with asyncio.run(main()) to let
asyncio handle creation/cleanup, or if you must keep the manual approach around
the main() call, add loop.close() in the finally block after printing (or
instead of the print) to ensure resources are released; update references to the
loop variable accordingly (main(), loop).

@Vasilije1990 Vasilije1990 merged commit a1ae694 into dev Apr 5, 2026
153 of 154 checks passed
@Vasilije1990 Vasilije1990 deleted the feature/beam-benchmark-eval branch April 5, 2026 00:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant