TraceLens¶
Evaluation and regression-testing for AI agents. TraceLens turns agent runs into inspectable traces, graded outcomes, baseline comparisons, calibration data, and CI-ready reliability signals.
It is deliberately domain-agnostic: TraceLens evaluates evidence, while your project owns the task data, agent invocation, graders, and rollout policy.
Start here¶
-
From a fresh checkout to your first eval run in five minutes — no LLM keys, no config files.
-
The evaluation pipeline and every object — Task, Transcript, Outcome, Trial — explained on one page.
-
Capability vs reliability, with a truth table and which metric belongs in a CI gate.
-
Wire TraceLens into GitHub Actions with regression gating. Copy-paste workflow included.
-
Every public API explained, with decision trees for graders, adapters, and analysis methods.
What TraceLens gives you¶
New to the vocabulary below? The Core Concepts & Glossary page defines every object and shows how they connect.
- Inspectable traces — every run becomes a
Transcriptyou can read. - Outcome grading — deterministic
CodeGrader, LLM-as-judgeLLMGrader, andCompositeGrader, plus declarativeBehaviorContracts. - Statistical rigor —
pass@kfor capability,pass^kfor reliability, and bootstrap confidence intervals so signals aren't noise. - Baseline regression detection — canary, capability, and experimental baselines with severity-graded CI gates.
- Harness-vs-agent separation —
infra_errorandgrader_errorrates are tracked separately, so a broken eval never looks like a failing agent.
See Installation for uv, extras ([llm], [http]), and
development setup.
How the docs are organized¶
- Get Started — install, run hello-world, and the example ladder.
- Concepts — evaluation levels, the two reliability metrics, and accuracy best practices.
- Guides — end-to-end walkthroughs for baselines, human calibration, and CI.
- Reference — the auto-generated API Reference.
- Contributing — testing tiers and the release process.