KalmiaKalmia

Rapidly iterate on your agent.

Find, fix, and analyze its problematic behaviors.

Tracea7e2c841-3f19-4d6b-8a05-e91c7b2d0f38
Code Review - agent-v2.4gpt-4o
Metadata: correlationId=run-20250210-review-4821, max_tokens=16000, model=gpt-4o, temperature=0
+ Add behavior
1 issue detected
Timeline
Metrics
Raw
Analysis
Contents108
System
"You are Claude Code, Anth..."
User
"Please review the following..."
Assistant
"I'll review this pull request ..."
User
"Tool result"
Assistant
"Let me start by examining t..."
User
"Tool result"
Assistant
"Tool: Read"
User
"Tool result"
System

You are Claude Code, Anthropic's official CLI for Claude. You are an interactive CLI tool that helps users with software engineering tasks.

Use the instructions below and the tools available to you to assist the user. IMPORTANT: You must NEVER generate or guess URLs...

If the user asks for help or wants to give feedback inform them of the following: /help: Get help with using Claude Code...

Trace Inspection

Traces are compressed and analyzed by our internal agent before you read a single message. Get a summary of the full agent journey, key decisions, and issues up front.

User
LLM
Tool: Read
LLM
Tool: Edit
Tool: Bash
err
LLM

Experiments

Compare variants side by side across prompts, harnesses, or config changes.

Success
Latency
Coverage
Efficiency
v2.3
v2.4

Surface Problematic Traces

Automatically surface runs that deviate from normal patterns without reviewing every trace.

threshold

Variant Metrics

We're heavily optimized to support custom metrics for coding agents. Track tool errors, retries, subagent spins, files read, coverage, behavior patterns, and more across every variant you ship.

Tokens
Latency
Tool calls
Errors

Behavior Detection

Define behaviors in plain language and let our agent autonomously find traces with these behaviors.

retry_looppermission_deniedsuccessful_edithallucinationgood_tool_use
"flag traces where the agent retries the same tool more than 3 times"

Production Monitoring

Ingest production traces in real time and catch regressions before your users do.

3 active traces
throughput
42/min
p99
3.2s
error rate
1.2%

Understand your agents.

Set up in minutes.

Request Demo