The platform

Debug your agents before your users do.

Kalmia captures every LLM call, tool invocation, and decision your agent makes — then gives you the tools to compare variants, detect patterns, and ship with confidence.

How it works

Three steps from agent run to insight.

Instrument

Install the SDK and wrap your LLM client. Every call is auto-captured — no manual logging needed.

from kalmia_sdk import init_logger, wrap_anthropic, traced
import anthropic

init_logger(project_name="my-agent")
client = wrap_anthropic(anthropic.Anthropic())

@traced(name="agent-run")
def run(prompt):
    return client.messages.create(...)

Compare

Group traces into experiments and compare variants side by side — different prompts, models, or tool configurations.

# Run variant A
@traced(name="gpt4-with-rag")
def variant_a(prompt): ...

# Run variant B
@traced(name="claude-no-rag")
def variant_b(prompt): ...

# Both appear in the same experiment

Analyze

Kalmia preprocesses every trace, computes metrics, and detects failure patterns automatically.

// Kalmia computes per-variant:
→ avg duration, tokens, turns
→ tool call breakdown
→ error rate, retry count
→ failure tag detection
→ behavior annotations

Features

Everything you need to understand agent behavior.

Trace Inspection

Every agent run is broken down into a timeline of LLM calls, tool invocations, and decisions. Our internal agent compresses and analyzes traces before you read a single message — giving you a summary of the full journey, key decisions, and issues up front.

Experiment Comparison

Group traces into experiments and compare variants side by side. See how different prompts, models, or configurations affect success rate, latency, token usage, and tool call patterns across every run.

Variant Metrics

Heavily optimized for coding agents. Track tool errors, retries, subagent spawns, files read, backtrack count, behavior patterns, and more — aggregated per variant with min/max/median distributions.

Behavior Detection

Define behaviors in plain language like 'agent retries the same tool 3+ times' and Kalmia's internal agent will autonomously scan your traces and tag matches with confidence scores.

Failure Tag Detection

Traces are automatically tagged with failure patterns: permission denied, file not found, high token usage, excessive retries, hallucinations. Surface problems without reading every trace.

AI-Powered Trace Analysis

Kalmia's internal agent reads and compresses your traces before you do — surfacing a summary of the full journey, key decisions, tool usage patterns, and potential issues so you can skip straight to what matters.