The platform
Debug your agents before your users do.
Kalmia captures every LLM call, tool invocation, and decision your agent makes — then gives you the tools to compare variants, detect patterns, and ship with confidence.
How it works
Three steps from agent run to insight.
Instrument
Install the SDK and wrap your LLM client. Every call is auto-captured — no manual logging needed.
from kalmia_sdk import init_logger, wrap_anthropic, traced
import anthropic
init_logger(project_name="my-agent")
client = wrap_anthropic(anthropic.Anthropic())
@traced(name="agent-run")
def run(prompt):
return client.messages.create(...)Compare
Group traces into experiments and compare variants side by side — different prompts, models, or tool configurations.
# Run variant A @traced(name="gpt4-with-rag") def variant_a(prompt): ... # Run variant B @traced(name="claude-no-rag") def variant_b(prompt): ... # Both appear in the same experiment
Analyze
Kalmia preprocesses every trace, computes metrics, and detects failure patterns automatically.
// Kalmia computes per-variant: → avg duration, tokens, turns → tool call breakdown → error rate, retry count → failure tag detection → behavior annotations
Features
Everything you need to understand agent behavior.
Trace Inspection
Every agent run is broken down into a timeline of LLM calls, tool invocations, and decisions. Our internal agent compresses and analyzes traces before you read a single message — giving you a summary of the full journey, key decisions, and issues up front.
Experiment Comparison
Group traces into experiments and compare variants side by side. See how different prompts, models, or configurations affect success rate, latency, token usage, and tool call patterns across every run.
Variant Metrics
Heavily optimized for coding agents. Track tool errors, retries, subagent spawns, files read, backtrack count, behavior patterns, and more — aggregated per variant with min/max/median distributions.
Behavior Detection
Define behaviors in plain language like 'agent retries the same tool 3+ times' and Kalmia's internal agent will autonomously scan your traces and tag matches with confidence scores.
Failure Tag Detection
Traces are automatically tagged with failure patterns: permission denied, file not found, high token usage, excessive retries, hallucinations. Surface problems without reading every trace.
AI-Powered Trace Analysis
Kalmia's internal agent reads and compresses your traces before you do — surfacing a summary of the full journey, key decisions, tool usage patterns, and potential issues so you can skip straight to what matters.

