Guide

Comparing Agent Variants

Set up A/B experiments to compare different prompts, models, or tool configurations across your agent.

What is a variant?

A variant is a specific configuration of your agent — a particular prompt, model, temperature, or set of tools. When you run the same task with different configurations, each run becomes a variant in your experiment.

Kalmia groups traces by the variant field in their metadata and computes per-variant metrics, letting you compare side by side.

Setting up an experiment

Run your agent with different configurations, tagging each run with a variant name in the metadata.

# Variant A: GPT-4o with RAG
with logger.start_span(name="agent-run") as span:
    span.log(metadata={
        "correlationId": "run-gpt4-rag-001",
        "variant": "gpt4o-with-rag",
    })

# Variant B: Claude with no RAG
with logger.start_span(name="agent-run") as span:
    span.log(metadata={
        "correlationId": "run-claude-norag-001",
        "variant": "claude-no-rag",
    })

Register the experiment

Group both variants into the same experiment.

curl -X POST /api/experiments \
  -H "Content-Type: application/json" \
  -d '{
    "name": "RAG vs no-RAG",
    "correlationIds": [
      "run-gpt4-rag-001",
      "run-claude-norag-001"
    ]
  }'

What you see in the dashboard

Kalmia preprocesses each trace and groups them by variant. The variant comparison table shows:

DurationAverage, min, max execution time per variant

Token usageInput/output/total tokens per variant

Tool callsNumber and types of tool invocations

TurnsNumber of LLM conversation turns

Error ratePercentage of runs with failures

RetriesCount of retry attempts across runs

Select any two variants in the comparison table to see delta badges highlighting which variant performs better for each metric.

Tips

Run multiple traces per variant to get meaningful aggregate metrics.
Use the same input prompts across variants for a fair comparison.
Add more runs to an experiment at any time — metrics recompute on every view.