feat: add tool-aware live-eval payloads and a deterministic tool-call accuracy scorer by omeraplak · Pull Request #1055 · VoltAgent/voltagent

omeraplak · 2026-02-12T04:13:28Z

PR Checklist

Please check if your PR fulfills the following requirements:

The commit message follows our guidelines: https://voltagent.dev/docs/community/contributing/#commit-convention

Bugs / Features

Related issue(s) linked
Tests for the changes have been added
Docs have been added / updated
Changesets have been added https://voltagent.dev/docs/community/contributing/#creating-a-changeset

What is the current behavior?

What is the new behavior?

fixes (issue)

Notes for reviewers

Summary by cubic

Adds tool-aware live-eval payloads (messages, toolCalls, toolResults) and a deterministic Tool Call Accuracy scorer. Adds an onToolError hook for custom tool error serialization and improves observability.

New Features
- Live-eval payload exposes messages, toolCalls, and toolResults; derives tool data from the step chain or metadata when absent.
- Exported new types: AgentEvalToolCall and AgentEvalToolResult.
- Added createToolCallAccuracyScorerCode for single-tool or ordered-chain validation; supports strict/lenient modes and falls back to messages/output when toolCalls are missing.
- Observability: scorer spans include counts for messages, toolCalls, and toolResults.
- New onToolError hook to customize tool error payloads before serialization.
- Examples and docs updated (tool-eval agent, tool order and execution-health scorers); tests added.

^{Written for commit 596c0ad. Summary will update on new commits.}

Summary by CodeRabbit

New Features
- Tool-aware live-eval payloads (messages, toolCalls, toolResults) and exported tool-call/result types
- Built-in Tool Call Accuracy scorer (single-tool and ordered modes, strict/lenient)
- Demo agent/example showcasing tool-eval flows and scorers
Documentation
- Updated docs, READMEs and changelog (including onToolError hook example) with live-eval and scorer guidance
Tests
- New tests covering eval payload construction and tool-call accuracy scorer behaviors

… accuracy scorer

changeset-bot · 2026-02-12T04:13:33Z

🦋 Changeset detected

Latest commit: 596c0ad

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 2 packages

Name	Type
@voltagent/core	Minor
@voltagent/scorers	Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

coderabbitai · 2026-02-12T04:13:46Z

📝 Walkthrough

Walkthrough

Adds tool-aware live-eval payloads (messages, toolCalls, toolResults), new public AgentEvalToolCall/AgentEvalToolResult types, normalization/extraction helpers in eval payload construction, a code-mode createToolCallAccuracyScorerCode scorer (single-tool and ordered modes, strict/lenient), tests, examples, and documentation updates.

Changes

Cohort / File(s)	Summary
Type Definitions & Exports `packages/core/src/agent/types.ts`, `packages/core/src/index.ts`	Add `AgentEvalToolCall` and `AgentEvalToolResult` interfaces; extend `AgentEvalPayload` with `messages`, `toolCalls`, and `toolResults`; export new types.
Core Eval Logic & Tests `packages/core/src/agent/eval.ts`, `packages/core/src/agent/eval.spec.ts`	Add normalization and extraction helpers for message chains, tool calls, and tool results; update `buildEvalPayload` to populate `messages`, `toolCalls`, and `toolResults`; comprehensive tests for metadata-driven payloads and edge cases.
Tool Call Accuracy Scorer & Tests `packages/scorers/src/tool-call-accuracy.ts`, `packages/scorers/src/tool-call-accuracy.spec.ts`	Introduce `createToolCallAccuracyScorerCode` with payload resolution, layered tool-name extraction (toolCalls → messages → outputs), single-tool and ordered-tool modes, strict/lenient matching, evaluation metadata, reason text, and extensive unit tests.
Scorer Index & Docs `packages/scorers/src/index.ts`, `packages/scorers/README.md`, `website/evaluation-docs/prebuilt-scorers.md`	Export new scorer and related types; update README/docs to show usage examples (expectedTool, expectedToolOrder, strictMode) and explain payload precedence and scorer patterns.
Examples & Demo `examples/with-live-evals/src/index.ts`, `examples/with-live-evals/README.md`	Add demo tools (`searchProducts`, `checkInventory`), tool-eval agent and scorer examples (order + execution-health), and README section describing live-eval patterns demonstrated.
Website Live-Eval Docs `website/evaluation-docs/live-evaluations.md`	Document new optional `messages`, `toolCalls`, and `toolResults` fields on AgentEvalContext and their structures for process-level scoring.
Changelogs `.changeset/good-bottles-wave.md`, `.changeset/good-bottles-hike.md`	Record feature additions: tool-aware eval payloads, new exported types, and createToolCallAccuracyScorerCode with examples and tests.

Sequence Diagram(s)

sequenceDiagram
  participant Client as Client
  participant Agent as Agent
  participant Tools as Tools
  participant Scorer as Scorer/Evaluator

  Client->>Agent: Send user query (may trigger tools)
  Agent->>Tools: Invoke tool(s) (tool_call)
  Tools-->>Agent: Return tool result(s) (tool_result)
  Agent->>Agent: Build eval payload (messages, toolCalls, toolResults)
  Agent->>Scorer: Run createToolCallAccuracyScorerCode (payload)
  Scorer-->>Agent: Return score + metadata/reason
  Agent-->>Client: Respond with result and evaluation info

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through chains of message thread,

Counting calls and results ahead,
Order checked and errors sighed,
Scorers tally every stride,
Little paws, big logs — evaluation fed.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: add tool-aware live-eval payloads and a deterministic tool-call accuracy scorer' accurately and comprehensively describes the main changes introduced across the PR.
Description check	✅ Passed	The description is mostly complete with checked items for tests, docs, and changesets. However, required template sections 'What is the current behavior?' and 'What is the new behavior?' are empty placeholders, and related issues are not linked.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/add-tool-scorers

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments

.changeset/good-bottles-wave.md (2)
7-35: Consider adding brief hook signature documentation.

The example demonstrates usage but doesn't explain the hook's contract. Consider adding a brief note about:

The difference between originalError and error parameters

What happens when the hook returns undefined vs. an object

The expected structure of the return value

This will help users understand the API without needing to reference external documentation.

17-32: Consider improving TypeScript type safety in the example.

The example uses multiple as any type assertions (lines 18, 28, 29) and mixes properties from both error and originalError without explanation. While this may be acceptable for example code, consider:

Adding a brief comment explaining when to use error vs. originalError

Optionally demonstrating a type guard approach instead of as any casts

For example:
// Type guard approach (optional enhancement)
interface AxiosError extends Error {
  isAxiosError: true;
  code?: string;
  response?: { status: number };
}

const isAxiosError = (err: unknown): err is AxiosError => {
  return (err as any)?.isAxiosError === true;
};

// In hook:
if (!isAxiosError(originalError)) {
  return;
}
// Now TypeScript knows originalError is AxiosError
return {
  output: {
    error: true,
    name: originalError.name,
    message: originalError.message,
    code: originalError.code,
    status: originalError.response?.status,
  },
};

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

cubic-dev-ai

No issues found across 13 files

cloudflare-workers-and-pages · 2026-02-12T04:18:17Z

Deploying voltagent with Cloudflare Pages

Latest commit:	`596c0ad`
Status:	⚡️ Build in progress...

View logs

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@packages/scorers/src/tool-call-accuracy.ts`:
- Around line 307-341: extractToolName currently falls back to
normalizeToolTypeName when a normalized type is not "tool_call", which can
produce spurious tool names from AI SDK stream event types (e.g.,
"tool_input_start"). Update extractToolName to call normalizeToolTypeName only
after checking normalizeMessageType(value.type) and ensuring the normalizedType
indicates a real tool invocation: require normalizedType to start with "tool"
(or "tool_") and explicitly exclude known stream-event patterns such as
prefixes/suffixes like "tool_input_", "tool_output_", suffixes "_start", "_end"
(or any other SDK stream event tokens you have), so stream events won't be
converted into phantom tool names; keep references to normalizeMessageType and
normalizeToolTypeName in the new guard.

🧹 Nitpick comments (3)

packages/scorers/src/tool-call-accuracy.spec.ts (1)

174-178: Consider adding a test for empty toolCalls against an expectedTool.

The error-case test is good. You might also consider a test verifying behavior when the payload has toolCalls: [] (or no tool data at all) with an expectedTool configured — this would confirm the scorer returns score 0 when the expected tool was never called, which is a likely real-world scenario.
packages/core/src/agent/eval.ts (1)
880-890: Shallow-copy fallback in cloneUnknownArray may not preserve deep-clone semantics.

When safeStringify fails (e.g., circular references), the fallback [...value] creates a shallow copy. If downstream code mutates nested objects in the returned array, it will affect the original. Since the primary path (JSON.parse(safeStringify(...))) handles the common case and the fallback is a last-resort safety net, this is a minor concern.
Consider using structuredClone for the fallback
   try {
     return JSON.parse(safeStringify(value)) as T[];
   } catch {
-    return [...value] as T[];
+    try {
+      return structuredClone(value) as T[];
+    } catch {
+      return [...value] as T[];
+    }
   }
packages/scorers/src/tool-call-accuracy.ts (1)
453-459: isPlainRecord uses stricter prototype check than the core module's version.

This version checks Object.getPrototypeOf(value) === Object.prototype || proto === null, which correctly rejects class instances and arrays. The return type Record<string, any> is slightly less strict than using unknown, but it's consistent with the property access patterns in this file.
Nit: consider `Record` for tighter type narrowing
-function isPlainRecord(value: unknown): value is Record<string, any> {
+function isPlainRecord(value: unknown): value is Record<string, unknown> {

packages/scorers/src/tool-call-accuracy.ts

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

packages/core/src/agent/eval.ts (1)
202-205: ⚠️ Potential issue | 🟡 Minor

Coding guideline violation: use safeStringify instead of JSON.stringify.

Line 204 uses JSON.stringify(expected) directly. The codebase guideline requires using safeStringify from @voltagent/internal/utils for all serialization in .ts files.
Proposed fix
   const expected = metrics.combinedMetadata?.expected;
   if (expected) {
     attributes["eval.expected"] =
-      typeof expected === "string" ? expected : JSON.stringify(expected);
+      typeof expected === "string" ? expected : safeStringify(expected);
   }
As per coding guidelines: `**/*.ts`: Never use JSON.stringify; use the `safeStringify` function instead, imported from `@voltagent/internal`

🧹 Nitpick comments (5)

packages/core/src/agent/eval.ts (2)

880-890: Unchecked type cast on deserialized metadata arrays.

cloneUnknownArray<T> casts the parsed JSON to T[] without any runtime shape validation. When called with cloneUnknownArray<AgentEvalToolCall>(metadataRecord?.toolCalls), malformed objects in metadata will silently masquerade as AgentEvalToolCall items and could cause downstream scorer failures that are hard to trace.

Consider adding a minimal shape guard (e.g., checking for required fields like toolName) or at least documenting that callers must ensure the source data conforms.

857-878: Potential duplicate messages when oc.input overlaps with oc.conversationSteps.

buildEvalMessageChain concatenates normalizeMessageChainSource(oc.input) with normalizeConversationSteps(oc.conversationSteps). If the conversation steps already include the user's input message, the chain will contain duplicates. If this is intentional (input is always a separate prompt, not part of conversation history), a brief comment clarifying the contract would help future maintainers.
packages/scorers/src/tool-call-accuracy.ts (3)
60-154: Clean scorer factory with well-separated score and reason phases.

The buildScorer chain pattern is used correctly. The validation at lines 75–79 prevents misconfiguration. The mode determination logic is sound.

One subtle behavior: if a user provides both expectedTool and expectedToolOrder, the mode silently becomes "tool_order" and expectedTool is only included in the evaluation metadata — it doesn't influence scoring. Consider documenting this precedence, or throwing if both are supplied, to avoid confusion.

227-268: Potential duplicate tool names from a single message in mixed-format payloads.

A message shaped like { type: "tool-call", toolName: "search", toolInvocations: [{ toolName: "search" }] } would push "search" twice into actualTools — once from the direct extraction (line 241–244) and once from toolInvocations (line 248). In strict single-tool mode, the actualTools.length === 1 check at line 107 would then fail despite the correct tool being called.

This is unlikely with real SDK payloads, but could trip up users with custom buildPayload implementations. A simple guard (e.g., early continue after the tool_call branch, or deduplication at the call site) would harden strict mode.

498-503: Shallow merge silently overwrites user-provided voltAgent metadata.

mergeMetadata uses a flat spread, so the hard-coded voltAgent key from additional (line 87–93) will completely replace any voltAgent the user passes via the metadata option. For example:
createToolCallAccuracyScorerCode({
  expectedTool: "search",
  metadata: { voltAgent: { team: "platform" }, custom: true },
})
// metadata.voltAgent.team is silently dropped
A shallow merge on the nested voltAgent key would preserve both sides:
Suggested fix
 function mergeMetadata(
   base: Record<string, unknown> | null | undefined,
   additional: Record<string, unknown>,
 ): Record<string, unknown> {
-  return { ...base, ...additional };
+  const merged = { ...base, ...additional };
+  if (base && typeof base.voltAgent === "object" && base.voltAgent !== null &&
+      typeof additional.voltAgent === "object" && additional.voltAgent !== null) {
+    merged.voltAgent = { ...base.voltAgent as Record<string, unknown>, ...additional.voltAgent as Record<string, unknown> };
+  }
+  return merged;
 }

feat: add tool-aware live-eval payloads and a deterministic tool-call…

8337ae0

… accuracy scorer

This comment has been minimized.

Sign in to view

cubic-dev-ai bot reviewed Feb 12, 2026

View reviewed changes

coderabbitai bot reviewed Feb 12, 2026

View reviewed changes

packages/scorers/src/tool-call-accuracy.ts Show resolved Hide resolved

chore: code reviews

68d0590

coderabbitai bot reviewed Feb 12, 2026

View reviewed changes

lzj960515 approved these changes Feb 12, 2026

View reviewed changes

chore: update changeset

596c0ad

omeraplak merged commit 21891b4 into main Feb 12, 2026
22 of 23 checks passed

omeraplak deleted the feat/add-tool-scorers branch February 12, 2026 16:08

voltagent-bot mentioned this pull request Feb 12, 2026

ci(changesets): version packages #1057

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add tool-aware live-eval payloads and a deterministic tool-call accuracy scorer#1055

feat: add tool-aware live-eval payloads and a deterministic tool-call accuracy scorer#1055
omeraplak merged 3 commits intomainfrom
feat/add-tool-scorers

omeraplak commented Feb 12, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

changeset-bot bot commented Feb 12, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Feb 12, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

This comment has been minimized.

cubic-dev-ai bot left a comment

Uh oh!

cloudflare-workers-and-pages bot commented Feb 12, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

omeraplak commented Feb 12, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Checklist

Bugs / Features

What is the current behavior?

What is the new behavior?

Notes for reviewers

Summary by cubic

Summary by CodeRabbit

Uh oh!

changeset-bot bot commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

coderabbitai bot commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

This comment has been minimized.

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cloudflare-workers-and-pages bot commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying voltagent with Cloudflare Pages

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

omeraplak commented Feb 12, 2026 •

edited by coderabbitai bot

Loading

changeset-bot bot commented Feb 12, 2026 •

edited

Loading

coderabbitai bot commented Feb 12, 2026 •

edited

Loading

cloudflare-workers-and-pages bot commented Feb 12, 2026 •

edited

Loading