Skip to content

feat: add tool-aware live-eval payloads and a deterministic tool-call accuracy scorer#1055

Merged
omeraplak merged 3 commits intomainfrom
feat/add-tool-scorers
Feb 12, 2026
Merged

feat: add tool-aware live-eval payloads and a deterministic tool-call accuracy scorer#1055
omeraplak merged 3 commits intomainfrom
feat/add-tool-scorers

Conversation

@omeraplak
Copy link
Member

@omeraplak omeraplak commented Feb 12, 2026

PR Checklist

Please check if your PR fulfills the following requirements:

Bugs / Features

What is the current behavior?

What is the new behavior?

fixes (issue)

Notes for reviewers


Summary by cubic

Adds tool-aware live-eval payloads (messages, toolCalls, toolResults) and a deterministic Tool Call Accuracy scorer. Adds an onToolError hook for custom tool error serialization and improves observability.

  • New Features
    • Live-eval payload exposes messages, toolCalls, and toolResults; derives tool data from the step chain or metadata when absent.
    • Exported new types: AgentEvalToolCall and AgentEvalToolResult.
    • Added createToolCallAccuracyScorerCode for single-tool or ordered-chain validation; supports strict/lenient modes and falls back to messages/output when toolCalls are missing.
    • Observability: scorer spans include counts for messages, toolCalls, and toolResults.
    • New onToolError hook to customize tool error payloads before serialization.
    • Examples and docs updated (tool-eval agent, tool order and execution-health scorers); tests added.

Written for commit 596c0ad. Summary will update on new commits.

Summary by CodeRabbit

  • New Features

    • Tool-aware live-eval payloads (messages, toolCalls, toolResults) and exported tool-call/result types
    • Built-in Tool Call Accuracy scorer (single-tool and ordered modes, strict/lenient)
    • Demo agent/example showcasing tool-eval flows and scorers
  • Documentation

    • Updated docs, READMEs and changelog (including onToolError hook example) with live-eval and scorer guidance
  • Tests

    • New tests covering eval payload construction and tool-call accuracy scorer behaviors

@changeset-bot
Copy link

changeset-bot bot commented Feb 12, 2026

🦋 Changeset detected

Latest commit: 596c0ad

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 2 packages
Name Type
@voltagent/core Minor
@voltagent/scorers Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 12, 2026

📝 Walkthrough

Walkthrough

Adds tool-aware live-eval payloads (messages, toolCalls, toolResults), new public AgentEvalToolCall/AgentEvalToolResult types, normalization/extraction helpers in eval payload construction, a code-mode createToolCallAccuracyScorerCode scorer (single-tool and ordered modes, strict/lenient), tests, examples, and documentation updates.

Changes

Cohort / File(s) Summary
Type Definitions & Exports
packages/core/src/agent/types.ts, packages/core/src/index.ts
Add AgentEvalToolCall and AgentEvalToolResult interfaces; extend AgentEvalPayload with messages, toolCalls, and toolResults; export new types.
Core Eval Logic & Tests
packages/core/src/agent/eval.ts, packages/core/src/agent/eval.spec.ts
Add normalization and extraction helpers for message chains, tool calls, and tool results; update buildEvalPayload to populate messages, toolCalls, and toolResults; comprehensive tests for metadata-driven payloads and edge cases.
Tool Call Accuracy Scorer & Tests
packages/scorers/src/tool-call-accuracy.ts, packages/scorers/src/tool-call-accuracy.spec.ts
Introduce createToolCallAccuracyScorerCode with payload resolution, layered tool-name extraction (toolCalls → messages → outputs), single-tool and ordered-tool modes, strict/lenient matching, evaluation metadata, reason text, and extensive unit tests.
Scorer Index & Docs
packages/scorers/src/index.ts, packages/scorers/README.md, website/evaluation-docs/prebuilt-scorers.md
Export new scorer and related types; update README/docs to show usage examples (expectedTool, expectedToolOrder, strictMode) and explain payload precedence and scorer patterns.
Examples & Demo
examples/with-live-evals/src/index.ts, examples/with-live-evals/README.md
Add demo tools (searchProducts, checkInventory), tool-eval agent and scorer examples (order + execution-health), and README section describing live-eval patterns demonstrated.
Website Live-Eval Docs
website/evaluation-docs/live-evaluations.md
Document new optional messages, toolCalls, and toolResults fields on AgentEvalContext and their structures for process-level scoring.
Changelogs
.changeset/good-bottles-wave.md, .changeset/good-bottles-hike.md
Record feature additions: tool-aware eval payloads, new exported types, and createToolCallAccuracyScorerCode with examples and tests.

Sequence Diagram(s)

sequenceDiagram
  participant Client as Client
  participant Agent as Agent
  participant Tools as Tools
  participant Scorer as Scorer/Evaluator

  Client->>Agent: Send user query (may trigger tools)
  Agent->>Tools: Invoke tool(s) (tool_call)
  Tools-->>Agent: Return tool result(s) (tool_result)
  Agent->>Agent: Build eval payload (messages, toolCalls, toolResults)
  Agent->>Scorer: Run createToolCallAccuracyScorerCode (payload)
  Scorer-->>Agent: Return score + metadata/reason
  Agent-->>Client: Respond with result and evaluation info
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through chains of message thread,

Counting calls and results ahead,
Order checked and errors sighed,
Scorers tally every stride,
Little paws, big logs — evaluation fed.

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: add tool-aware live-eval payloads and a deterministic tool-call accuracy scorer' accurately and comprehensively describes the main changes introduced across the PR.
Description check ✅ Passed The description is mostly complete with checked items for tests, docs, and changesets. However, required template sections 'What is the current behavior?' and 'What is the new behavior?' are empty placeholders, and related issues are not linked.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/add-tool-scorers

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments
.changeset/good-bottles-wave.md (2)

7-35: Consider adding brief hook signature documentation.

The example demonstrates usage but doesn't explain the hook's contract. Consider adding a brief note about:

  • The difference between originalError and error parameters
  • What happens when the hook returns undefined vs. an object
  • The expected structure of the return value

This will help users understand the API without needing to reference external documentation.


17-32: Consider improving TypeScript type safety in the example.

The example uses multiple as any type assertions (lines 18, 28, 29) and mixes properties from both error and originalError without explanation. While this may be acceptable for example code, consider:

  1. Adding a brief comment explaining when to use error vs. originalError
  2. Optionally demonstrating a type guard approach instead of as any casts

For example:

// Type guard approach (optional enhancement)
interface AxiosError extends Error {
  isAxiosError: true;
  code?: string;
  response?: { status: number };
}

const isAxiosError = (err: unknown): err is AxiosError => {
  return (err as any)?.isAxiosError === true;
};

// In hook:
if (!isAxiosError(originalError)) {
  return;
}
// Now TypeScript knows originalError is AxiosError
return {
  output: {
    error: true,
    name: originalError.name,
    message: originalError.message,
    code: originalError.code,
    status: originalError.response?.status,
  },
};

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@joggrbot

This comment has been minimized.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 13 files

@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Feb 12, 2026

Deploying voltagent with  Cloudflare Pages  Cloudflare Pages

Latest commit: 596c0ad
Status:⚡️  Build in progress...

View logs

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@packages/scorers/src/tool-call-accuracy.ts`:
- Around line 307-341: extractToolName currently falls back to
normalizeToolTypeName when a normalized type is not "tool_call", which can
produce spurious tool names from AI SDK stream event types (e.g.,
"tool_input_start"). Update extractToolName to call normalizeToolTypeName only
after checking normalizeMessageType(value.type) and ensuring the normalizedType
indicates a real tool invocation: require normalizedType to start with "tool"
(or "tool_") and explicitly exclude known stream-event patterns such as
prefixes/suffixes like "tool_input_", "tool_output_", suffixes "_start", "_end"
(or any other SDK stream event tokens you have), so stream events won't be
converted into phantom tool names; keep references to normalizeMessageType and
normalizeToolTypeName in the new guard.
🧹 Nitpick comments (3)
packages/scorers/src/tool-call-accuracy.spec.ts (1)

174-178: Consider adding a test for empty toolCalls against an expectedTool.

The error-case test is good. You might also consider a test verifying behavior when the payload has toolCalls: [] (or no tool data at all) with an expectedTool configured — this would confirm the scorer returns score 0 when the expected tool was never called, which is a likely real-world scenario.

packages/core/src/agent/eval.ts (1)

880-890: Shallow-copy fallback in cloneUnknownArray may not preserve deep-clone semantics.

When safeStringify fails (e.g., circular references), the fallback [...value] creates a shallow copy. If downstream code mutates nested objects in the returned array, it will affect the original. Since the primary path (JSON.parse(safeStringify(...))) handles the common case and the fallback is a last-resort safety net, this is a minor concern.

Consider using structuredClone for the fallback
   try {
     return JSON.parse(safeStringify(value)) as T[];
   } catch {
-    return [...value] as T[];
+    try {
+      return structuredClone(value) as T[];
+    } catch {
+      return [...value] as T[];
+    }
   }
packages/scorers/src/tool-call-accuracy.ts (1)

453-459: isPlainRecord uses stricter prototype check than the core module's version.

This version checks Object.getPrototypeOf(value) === Object.prototype || proto === null, which correctly rejects class instances and arrays. The return type Record<string, any> is slightly less strict than using unknown, but it's consistent with the property access patterns in this file.

Nit: consider `Record` for tighter type narrowing
-function isPlainRecord(value: unknown): value is Record<string, any> {
+function isPlainRecord(value: unknown): value is Record<string, unknown> {

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/core/src/agent/eval.ts (1)

202-205: ⚠️ Potential issue | 🟡 Minor

Coding guideline violation: use safeStringify instead of JSON.stringify.

Line 204 uses JSON.stringify(expected) directly. The codebase guideline requires using safeStringify from @voltagent/internal/utils for all serialization in .ts files.

Proposed fix
   const expected = metrics.combinedMetadata?.expected;
   if (expected) {
     attributes["eval.expected"] =
-      typeof expected === "string" ? expected : JSON.stringify(expected);
+      typeof expected === "string" ? expected : safeStringify(expected);
   }

As per coding guidelines: `**/*.ts`: Never use JSON.stringify; use the `safeStringify` function instead, imported from `@voltagent/internal`

🧹 Nitpick comments (5)
packages/core/src/agent/eval.ts (2)

880-890: Unchecked type cast on deserialized metadata arrays.

cloneUnknownArray<T> casts the parsed JSON to T[] without any runtime shape validation. When called with cloneUnknownArray<AgentEvalToolCall>(metadataRecord?.toolCalls), malformed objects in metadata will silently masquerade as AgentEvalToolCall items and could cause downstream scorer failures that are hard to trace.

Consider adding a minimal shape guard (e.g., checking for required fields like toolName) or at least documenting that callers must ensure the source data conforms.


857-878: Potential duplicate messages when oc.input overlaps with oc.conversationSteps.

buildEvalMessageChain concatenates normalizeMessageChainSource(oc.input) with normalizeConversationSteps(oc.conversationSteps). If the conversation steps already include the user's input message, the chain will contain duplicates. If this is intentional (input is always a separate prompt, not part of conversation history), a brief comment clarifying the contract would help future maintainers.

packages/scorers/src/tool-call-accuracy.ts (3)

60-154: Clean scorer factory with well-separated score and reason phases.

The buildScorer chain pattern is used correctly. The validation at lines 75–79 prevents misconfiguration. The mode determination logic is sound.

One subtle behavior: if a user provides both expectedTool and expectedToolOrder, the mode silently becomes "tool_order" and expectedTool is only included in the evaluation metadata — it doesn't influence scoring. Consider documenting this precedence, or throwing if both are supplied, to avoid confusion.


227-268: Potential duplicate tool names from a single message in mixed-format payloads.

A message shaped like { type: "tool-call", toolName: "search", toolInvocations: [{ toolName: "search" }] } would push "search" twice into actualTools — once from the direct extraction (line 241–244) and once from toolInvocations (line 248). In strict single-tool mode, the actualTools.length === 1 check at line 107 would then fail despite the correct tool being called.

This is unlikely with real SDK payloads, but could trip up users with custom buildPayload implementations. A simple guard (e.g., early continue after the tool_call branch, or deduplication at the call site) would harden strict mode.


498-503: Shallow merge silently overwrites user-provided voltAgent metadata.

mergeMetadata uses a flat spread, so the hard-coded voltAgent key from additional (line 87–93) will completely replace any voltAgent the user passes via the metadata option. For example:

createToolCallAccuracyScorerCode({
  expectedTool: "search",
  metadata: { voltAgent: { team: "platform" }, custom: true },
})
// metadata.voltAgent.team is silently dropped

A shallow merge on the nested voltAgent key would preserve both sides:

Suggested fix
 function mergeMetadata(
   base: Record<string, unknown> | null | undefined,
   additional: Record<string, unknown>,
 ): Record<string, unknown> {
-  return { ...base, ...additional };
+  const merged = { ...base, ...additional };
+  if (base && typeof base.voltAgent === "object" && base.voltAgent !== null &&
+      typeof additional.voltAgent === "object" && additional.voltAgent !== null) {
+    merged.voltAgent = { ...base.voltAgent as Record<string, unknown>, ...additional.voltAgent as Record<string, unknown> };
+  }
+  return merged;
 }

@omeraplak omeraplak merged commit 21891b4 into main Feb 12, 2026
22 of 23 checks passed
@omeraplak omeraplak deleted the feat/add-tool-scorers branch February 12, 2026 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants