KalmiaKalmia
Back to resources

Guide

Comparing Agent Variants

Set up A/B experiments to compare different prompts, models, or tool configurations across your agent.

What is a variant?

A variant is a specific configuration of your agent — a particular prompt, model, temperature, or set of tools. When you run the same task with different configurations, each run becomes a variant in your experiment.

Kalmia groups traces by the variant field in their metadata and computes per-variant metrics, letting you compare side by side.

Setting up an experiment

Run your agent with different configurations, tagging each run with a variant name in the metadata.

# Variant A: GPT-4o with RAG
with logger.start_span(name="agent-run") as span:
    span.log(metadata={
        "correlationId": "run-gpt4-rag-001",
        "variant": "gpt4o-with-rag",
    })

# Variant B: Claude with no RAG
with logger.start_span(name="agent-run") as span:
    span.log(metadata={
        "correlationId": "run-claude-norag-001",
        "variant": "claude-no-rag",
    })

Register the experiment

Group both variants into the same experiment.

curl -X POST /api/experiments \
  -H "Content-Type: application/json" \
  -d '{
    "name": "RAG vs no-RAG",
    "correlationIds": [
      "run-gpt4-rag-001",
      "run-claude-norag-001"
    ]
  }'

What you see in the dashboard

Kalmia preprocesses each trace and groups them by variant. The variant comparison table shows:

DurationAverage, min, max execution time per variant
Token usageInput/output/total tokens per variant
Tool callsNumber and types of tool invocations
TurnsNumber of LLM conversation turns
Error ratePercentage of runs with failures
RetriesCount of retry attempts across runs

Select any two variants in the comparison table to see delta badges highlighting which variant performs better for each metric.

Tips