feat: add tool-aware live-eval payloads and a deterministic tool-call accuracy scorer#1055
feat: add tool-aware live-eval payloads and a deterministic tool-call accuracy scorer#1055
Conversation
🦋 Changeset detectedLatest commit: 596c0ad The changes in this PR will be included in the next version bump. This PR includes changesets to release 2 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
📝 WalkthroughWalkthroughAdds tool-aware live-eval payloads (messages, toolCalls, toolResults), new public AgentEvalToolCall/AgentEvalToolResult types, normalization/extraction helpers in eval payload construction, a code-mode createToolCallAccuracyScorerCode scorer (single-tool and ordered modes, strict/lenient), tests, examples, and documentation updates. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant Agent as Agent
participant Tools as Tools
participant Scorer as Scorer/Evaluator
Client->>Agent: Send user query (may trigger tools)
Agent->>Tools: Invoke tool(s) (tool_call)
Tools-->>Agent: Return tool result(s) (tool_result)
Agent->>Agent: Build eval payload (messages, toolCalls, toolResults)
Agent->>Scorer: Run createToolCallAccuracyScorerCode (payload)
Scorer-->>Agent: Return score + metadata/reason
Agent-->>Client: Respond with result and evaluation info
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
No actionable comments were generated in the recent review. 🎉 🧹 Recent nitpick comments
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@packages/scorers/src/tool-call-accuracy.ts`:
- Around line 307-341: extractToolName currently falls back to
normalizeToolTypeName when a normalized type is not "tool_call", which can
produce spurious tool names from AI SDK stream event types (e.g.,
"tool_input_start"). Update extractToolName to call normalizeToolTypeName only
after checking normalizeMessageType(value.type) and ensuring the normalizedType
indicates a real tool invocation: require normalizedType to start with "tool"
(or "tool_") and explicitly exclude known stream-event patterns such as
prefixes/suffixes like "tool_input_", "tool_output_", suffixes "_start", "_end"
(or any other SDK stream event tokens you have), so stream events won't be
converted into phantom tool names; keep references to normalizeMessageType and
normalizeToolTypeName in the new guard.
🧹 Nitpick comments (3)
packages/scorers/src/tool-call-accuracy.spec.ts (1)
174-178: Consider adding a test for emptytoolCallsagainst anexpectedTool.The error-case test is good. You might also consider a test verifying behavior when the payload has
toolCalls: [](or no tool data at all) with anexpectedToolconfigured — this would confirm the scorer returns score0when the expected tool was never called, which is a likely real-world scenario.packages/core/src/agent/eval.ts (1)
880-890: Shallow-copy fallback incloneUnknownArraymay not preserve deep-clone semantics.When
safeStringifyfails (e.g., circular references), the fallback[...value]creates a shallow copy. If downstream code mutates nested objects in the returned array, it will affect the original. Since the primary path (JSON.parse(safeStringify(...))) handles the common case and the fallback is a last-resort safety net, this is a minor concern.Consider using structuredClone for the fallback
try { return JSON.parse(safeStringify(value)) as T[]; } catch { - return [...value] as T[]; + try { + return structuredClone(value) as T[]; + } catch { + return [...value] as T[]; + } }packages/scorers/src/tool-call-accuracy.ts (1)
453-459:isPlainRecorduses stricter prototype check than the core module's version.This version checks
Object.getPrototypeOf(value) === Object.prototype || proto === null, which correctly rejects class instances and arrays. The return typeRecord<string, any>is slightly less strict than usingunknown, but it's consistent with the property access patterns in this file.Nit: consider `Record` for tighter type narrowing
-function isPlainRecord(value: unknown): value is Record<string, any> { +function isPlainRecord(value: unknown): value is Record<string, unknown> {
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/core/src/agent/eval.ts (1)
202-205:⚠️ Potential issue | 🟡 MinorCoding guideline violation: use
safeStringifyinstead ofJSON.stringify.Line 204 uses
JSON.stringify(expected)directly. The codebase guideline requires usingsafeStringifyfrom@voltagent/internal/utilsfor all serialization in.tsfiles.Proposed fix
const expected = metrics.combinedMetadata?.expected; if (expected) { attributes["eval.expected"] = - typeof expected === "string" ? expected : JSON.stringify(expected); + typeof expected === "string" ? expected : safeStringify(expected); }As per coding guidelines:
`**/*.ts`: Never use JSON.stringify; use the `safeStringify` function instead, imported from `@voltagent/internal`
🧹 Nitpick comments (5)
packages/core/src/agent/eval.ts (2)
880-890: Unchecked type cast on deserialized metadata arrays.
cloneUnknownArray<T>casts the parsed JSON toT[]without any runtime shape validation. When called withcloneUnknownArray<AgentEvalToolCall>(metadataRecord?.toolCalls), malformed objects in metadata will silently masquerade asAgentEvalToolCallitems and could cause downstream scorer failures that are hard to trace.Consider adding a minimal shape guard (e.g., checking for required fields like
toolName) or at least documenting that callers must ensure the source data conforms.
857-878: Potential duplicate messages whenoc.inputoverlaps withoc.conversationSteps.
buildEvalMessageChainconcatenatesnormalizeMessageChainSource(oc.input)withnormalizeConversationSteps(oc.conversationSteps). If the conversation steps already include the user's input message, the chain will contain duplicates. If this is intentional (input is always a separate prompt, not part of conversation history), a brief comment clarifying the contract would help future maintainers.packages/scorers/src/tool-call-accuracy.ts (3)
60-154: Clean scorer factory with well-separated score and reason phases.The
buildScorerchain pattern is used correctly. The validation at lines 75–79 prevents misconfiguration. The mode determination logic is sound.One subtle behavior: if a user provides both
expectedToolandexpectedToolOrder, the mode silently becomes"tool_order"andexpectedToolis only included in the evaluation metadata — it doesn't influence scoring. Consider documenting this precedence, or throwing if both are supplied, to avoid confusion.
227-268: Potential duplicate tool names from a single message in mixed-format payloads.A message shaped like
{ type: "tool-call", toolName: "search", toolInvocations: [{ toolName: "search" }] }would push"search"twice intoactualTools— once from the direct extraction (line 241–244) and once fromtoolInvocations(line 248). Instrictsingle-tool mode, theactualTools.length === 1check at line 107 would then fail despite the correct tool being called.This is unlikely with real SDK payloads, but could trip up users with custom
buildPayloadimplementations. A simple guard (e.g., earlycontinueafter thetool_callbranch, or deduplication at the call site) would harden strict mode.
498-503: Shallow merge silently overwrites user-providedvoltAgentmetadata.
mergeMetadatauses a flat spread, so the hard-codedvoltAgentkey fromadditional(line 87–93) will completely replace anyvoltAgentthe user passes via themetadataoption. For example:createToolCallAccuracyScorerCode({ expectedTool: "search", metadata: { voltAgent: { team: "platform" }, custom: true }, }) // metadata.voltAgent.team is silently droppedA shallow merge on the nested
voltAgentkey would preserve both sides:Suggested fix
function mergeMetadata( base: Record<string, unknown> | null | undefined, additional: Record<string, unknown>, ): Record<string, unknown> { - return { ...base, ...additional }; + const merged = { ...base, ...additional }; + if (base && typeof base.voltAgent === "object" && base.voltAgent !== null && + typeof additional.voltAgent === "object" && additional.voltAgent !== null) { + merged.voltAgent = { ...base.voltAgent as Record<string, unknown>, ...additional.voltAgent as Record<string, unknown> }; + } + return merged; }
PR Checklist
Please check if your PR fulfills the following requirements:
Bugs / Features
What is the current behavior?
What is the new behavior?
fixes (issue)
Notes for reviewers
Summary by cubic
Adds tool-aware live-eval payloads (messages, toolCalls, toolResults) and a deterministic Tool Call Accuracy scorer. Adds an onToolError hook for custom tool error serialization and improves observability.
Written for commit 596c0ad. Summary will update on new commits.
Summary by CodeRabbit
New Features
Documentation
Tests