Guide
Comparing Agent Variants
Set up A/B experiments to compare different prompts, models, or tool configurations across your agent.
What is a variant?
A variant is a specific configuration of your agent — a particular prompt, model, temperature, or set of tools. When you run the same task with different configurations, each run becomes a variant in your experiment.
Kalmia groups traces by the variant field in their metadata and computes per-variant metrics, letting you compare side by side.
Setting up an experiment
Run your agent with different configurations, tagging each run with a variant name in the metadata.
# Variant A: GPT-4o with RAG
with logger.start_span(name="agent-run") as span:
span.log(metadata={
"correlationId": "run-gpt4-rag-001",
"variant": "gpt4o-with-rag",
})
# Variant B: Claude with no RAG
with logger.start_span(name="agent-run") as span:
span.log(metadata={
"correlationId": "run-claude-norag-001",
"variant": "claude-no-rag",
})Register the experiment
Group both variants into the same experiment.
curl -X POST /api/experiments \
-H "Content-Type: application/json" \
-d '{
"name": "RAG vs no-RAG",
"correlationIds": [
"run-gpt4-rag-001",
"run-claude-norag-001"
]
}'What you see in the dashboard
Kalmia preprocesses each trace and groups them by variant. The variant comparison table shows:
Select any two variants in the comparison table to see delta badges highlighting which variant performs better for each metric.
Tips
- Run multiple traces per variant to get meaningful aggregate metrics.
- Use the same input prompts across variants for a fair comparison.
- Add more runs to an experiment at any time — metrics recompute on every view.

