Guide
Understanding Variant Metrics
What each metric means and how to use them to improve your agent.
Core metrics
These metrics are computed for every trace and aggregated per variant with min, max, and median distributions.
Duration
Total wall-clock time from the first span to the last. Measures end-to-end latency of the agent run.
Total tokens
Sum of all prompt and completion tokens across every LLM call in the trace.
Input tokens
Prompt tokens sent to the LLM. High input tokens may indicate large context windows or excessive conversation history.
Output tokens
Completion tokens received from the LLM. High output tokens may indicate verbose responses or unnecessary generation.
Turns
Number of LLM conversation turns in the trace. Each assistant response counts as one turn.
Agent-specific metrics
These metrics are optimized for coding agents and multi-step tool-using agents.
Tool calls
Total number of tool invocations. Broken down by tool name in the trace detail view.
Tool errors
Number of tool calls that returned an error. High error rates suggest misconfigured tools or incorrect arguments.
Retries
Number of times the agent retried the same or similar operation. Frequent retries indicate the agent is stuck.
Backtracks
Number of times the agent undid or reverted a previous action. Suggests the agent is exploring without a clear plan.
Files read
Number of unique files the agent read during the run. Useful for understanding context-gathering behavior.
Subagent spawns
Number of times the agent delegated work to a sub-agent or subprocess.
Failure tags
Traces are automatically tagged with failure patterns during preprocessing. These help you surface problems without reading every trace.
Reading the comparison table
In the variant comparison table, select two variants to see delta badges. A green delta means the selected variant is better (lower latency, fewer errors). A red delta means it's worse. Use these to quickly identify which configuration performs best.
Tips
- Metrics are recomputed from scratch every time you view an experiment — there's no caching, so you always see the latest data.
- Look at median rather than average for metrics like duration and tokens — outliers can skew averages significantly.
- High retry counts combined with high token usage is a strong signal that an agent variant needs improvement.

