KalmiaKalmia
Back to resources

Guide

Understanding Variant Metrics

What each metric means and how to use them to improve your agent.

Core metrics

These metrics are computed for every trace and aggregated per variant with min, max, and median distributions.

Duration

Total wall-clock time from the first span to the last. Measures end-to-end latency of the agent run.

Total tokens

Sum of all prompt and completion tokens across every LLM call in the trace.

Input tokens

Prompt tokens sent to the LLM. High input tokens may indicate large context windows or excessive conversation history.

Output tokens

Completion tokens received from the LLM. High output tokens may indicate verbose responses or unnecessary generation.

Turns

Number of LLM conversation turns in the trace. Each assistant response counts as one turn.

Agent-specific metrics

These metrics are optimized for coding agents and multi-step tool-using agents.

Tool calls

Total number of tool invocations. Broken down by tool name in the trace detail view.

Tool errors

Number of tool calls that returned an error. High error rates suggest misconfigured tools or incorrect arguments.

Retries

Number of times the agent retried the same or similar operation. Frequent retries indicate the agent is stuck.

Backtracks

Number of times the agent undid or reverted a previous action. Suggests the agent is exploring without a clear plan.

Files read

Number of unique files the agent read during the run. Useful for understanding context-gathering behavior.

Subagent spawns

Number of times the agent delegated work to a sub-agent or subprocess.

Failure tags

Traces are automatically tagged with failure patterns during preprocessing. These help you surface problems without reading every trace.

permission_deniedAgent encountered permission or access errors
file_not_foundAgent referenced files that don't exist
high_token_usageToken count exceeds normal range for the variant
excessive_retriesAgent retried the same operation too many times
hallucinationAgent generated references to non-existent code or files

Reading the comparison table

In the variant comparison table, select two variants to see delta badges. A green delta means the selected variant is better (lower latency, fewer errors). A red delta means it's worse. Use these to quickly identify which configuration performs best.

Tips