Skip to content

Quickstart: TraceLens in 10 Minutes

Get a working evaluation pipeline running from scratch.

Want the files generated for you first? Run tracelens init ., then come back to this page when you want to understand or replace each piece by hand.

Prerequisites

  • Python 3.11+
  • uv (recommended) or pip

1. Install TraceLens

# From PyPI
uv pip install tracelens

# Or for development
git clone https://github.com/ssf0409/tracelens.git
cd tracelens
uv pip install -e ".[dev]"

2. Define Tasks

Create a file tasks.json with your evaluation tasks:

{
  "tasks": [
    {
      "task_id": "math-add",
      "name": "Simple addition",
      "input_data": {"a": 2, "b": 3},
      "category": "arithmetic",
      "tags": ["addition", "easy"],
      "metadata": {"expected": 5}
    },
    {
      "task_id": "math-multiply",
      "name": "Multiplication",
      "input_data": {"a": 7, "b": 8},
      "category": "arithmetic",
      "tags": ["multiplication"],
      "metadata": {"expected": 56, "operation": "multiply"}
    }
  ]
}

Each task needs: - name and input_data (required) — what to test - metadata — any data graders need (e.g., expected answers) - tags, category, difficulty — for filtering and organization

See examples/tasks.json for a complete 5-task example.

3. Write a Simple Agent

An agent is any async function that takes input and returns output:

from typing import Any

async def math_agent(input_data: dict[str, Any]) -> dict[str, Any]:
    a = input_data["a"]
    b = input_data["b"]
    operation = input_data.get("operation", "add")

    if operation == "multiply":
        return {"answer": a * b}
    return {"answer": a + b}

Wrap it as an adapter:

from tracelens import SimpleAdapter

adapter = SimpleAdapter(math_agent)

For complex agents, subclass AgentAdapter directly — see the User Guide.

4. Write a Grader

A CodeGrader computes metrics and determines pass/fail:

from tracelens import CodeGrader
from tracelens.core.task import Task
from tracelens.core.transcript import Transcript


class MathGrader(CodeGrader):
    def __init__(self) -> None:
        super().__init__(grader_id="math")

    def compute_metrics(
        self, transcript: Transcript, task: Task
    ) -> dict[str, float]:
        expected = task.metadata["expected"]
        actual = transcript.final_output.get("answer")
        if actual is None:
            return {"correct": 0.0, "error": float("inf")}
        return {
            "correct": 1.0 if actual == expected else 0.0,
            "error": abs(actual - expected),
        }

    def determine_pass(
        self, metrics: dict[str, float], task: Task
    ) -> tuple[bool, float]:
        return metrics["correct"] == 1.0, metrics["correct"]

See examples/graders/ for both CodeGrader and LLMGrader examples.

5. Run Evaluation

import asyncio
from tracelens import EvalSet, EvaluationRunner, RunnerConfig
from tracelens.core.task import JSONTaskLoader

# Load tasks
loader = JSONTaskLoader()
tasks = loader.load("tasks.json")
eval_set = EvalSet(name="Math Suite", tasks=tasks)

# Run with 3 attempts per task (for pass@k statistics)
config = RunnerConfig(num_runs=3, max_concurrency=5, timeout_seconds=30.0)
runner = EvaluationRunner(adapter, [MathGrader()], config)
batch = asyncio.run(runner.run(eval_set))

print(f"Pass rate: {batch.pass_rate:.1%}")

6. Generate Report

from tracelens.reporting.generator import ReportGenerator

gen = ReportGenerator(k_values=[1, 3], consistency_k_values=[2, 3])
report = gen.build_report(batch)

# Markdown report
print(gen.render_markdown(report))

# HTML dashboard
with open("report.html", "w") as f:
    f.write(gen.render_html(report))

7. Interpret Results

The report includes two key statistics:

Metric Measures Higher k means...
pass@k Capability — "can the agent do this at all?" Higher values (more chances)
pass^k Reliability — "does the agent do this consistently?" Lower values (harder to be perfect)
  • pass@3 = 0.95 means: "95% chance at least 1 of 3 attempts succeeds" — strong capability
  • pass^3 = 0.60 means: "60% chance all 3 attempts succeed" — moderate reliability

If pass@k is high but pass^k is low, the agent is capable but inconsistent. Increase num_runs and investigate failure patterns.

8. Use the CLI

Run evaluations from the command line:

# Run and print CI summary
tracelens run \
  --eval-set tasks.json \
  --adapter myproject.adapters.MathAdapter \
  --graders myproject.graders.MathGrader \
  --num-runs 3 \
  --report report.md \
  --html-report report.html

# Generate report from saved results
tracelens report --results results.json --format html

Next Steps