Core Concepts & Glossary¶
Read this once and the rest of the docs click into place. TraceLens has a small vocabulary of objects that pass an agent run through a fixed pipeline: define the work, run the agent, record what happened, grade the outcome, and aggregate across repeats so you can reason about capability and reliability.
The pipeline¶
Every evaluation is the same left-to-right flow. The toy hello-world run and a production CI gate differ only in what plugs into each box — not in the shape.
Task AgentAdapter Transcript Grader
┌──────────┐ ┌────────────┐ ┌────────────┐ ┌──────────┐
│ what the │─────▶│ how to run │─────▶│ what the │─────▶│ did it │
│ agent │ │ the agent │ │ agent did │ │ succeed? │
│ must do │ └────────────┘ │ (the run │ └────┬─────┘
└──────────┘ │ record) │ │
└────────────┘ ▼
┌──────────┐
│ Outcome │
│ pass + │
│ score + │
│ feedback │
└────┬─────┘
│ one repeat = one
▼ ┌──────────┐
│ Trial │
EvaluationRunner repeats this N times per task, collecting └────┬─────┘
every Trial into a ───────────────────────────────────────────────▶ │
┌────────▼────────┐
│ TrialBatch │
│ pass@k, pass^k, │
│ error rates, │
│ baseline check │
└────────┬────────┘
▼
Report
(markdown / JSON / HTML / CI)
The four pieces you author are Task, Adapter, Grader, and the Runner config. Everything else — Transcript, Outcome, Trial, TrialBatch — TraceLens produces for you.
Glossary¶
| Term | One-line definition | Class(es) |
|---|---|---|
| Task | A single thing the agent must do, plus its input and expectations. | Task, EvalSet (a set of tasks) |
| Adapter | The glue that knows how to invoke your agent and return a Transcript. | AgentAdapter, SimpleAdapter, HTTPAPIAdapter |
| Transcript | The record of one run: final output, intermediate steps, timing, fingerprint. | Transcript, TranscriptStep |
| Grader | Turns a Transcript into a pass/score judgement. Deterministic or LLM-as-judge. | CodeGrader, LLMGrader, CompositeGrader |
| Outcome | The result of grading one Transcript: passed, score, feedback, and error flags. | Outcome, GradeLevel |
| Trial | One Task run once and graded — a Transcript + its Outcome. | Trial, TrialStatus |
| TrialBatch | All Trials for a run, with aggregate statistics and error rates. | TrialBatch |
| Runner | Drives the whole pipeline: parallelism, repeats, timeouts, checkpointing. | EvaluationRunner, RunnerConfig |
| DecisionSpec | A reproducibility fingerprint of the exact agent config (model, prompt, tools, infra). | DecisionSpec |
| Baseline | A stored known-good result a candidate run is compared against to detect regressions. | BaselineManager, RegressionDetector |
Two distinctions that matter¶
Capability vs. reliability. A run is summarized by two metrics, not one:
pass@k (can it succeed at all in k tries?) and pass^k (does it succeed
every time?). They move in opposite directions, and confusing them is how a
flaky agent passes review. This is important enough to have its own page:
pass@k vs pass^k.
Harness failure vs. agent failure. A failing grade is not the same as a
broken eval. TraceLens tracks two separate error rates on every TrialBatch:
infra_error— the agent run itself crashed (timeout, sandbox died).grader_error— a grader threw while judging (the grading harness broke).
A spike in either means your evaluation is broken, not that the agent got worse — so they are reported separately from the pass rate. Never let a harness failure masquerade as a regression.
Where to go next¶
- Getting Started — see this pipeline run end to end in five minutes.
- Build Your First Eval — author your own Task, Grader, and Runner.
- pass@k vs pass^k — the capability/reliability split in depth.
- API Reference — every class above, from its docstrings.