TraceLens¶

Evaluation and regression-testing for AI agents. TraceLens turns agent runs into inspectable traces, graded outcomes, baseline comparisons, calibration data, and CI-ready reliability signals.

It is deliberately domain-agnostic: TraceLens evaluates evidence, while your project owns the task data, agent invocation, graders, and rollout policy.

Start here¶

Getting Started

From a fresh checkout to your first eval run in five minutes — no LLM keys, no config files.
Core Concepts & Glossary

The evaluation pipeline and every object — Task, Transcript, Outcome, Trial — explained on one page.
pass@k vs pass^k

Capability vs reliability, with a truth table and which metric belongs in a CI gate.
CI/CD Integration

Wire TraceLens into GitHub Actions with regression gating. Copy-paste workflow included.
User Guide

Every public API explained, with decision trees for graders, adapters, and analysis methods.

What TraceLens gives you¶

New to the vocabulary below? The Core Concepts & Glossary page defines every object and shows how they connect.

Inspectable traces — every run becomes a Transcript you can read.
Outcome grading — deterministic CodeGrader, LLM-as-judge LLMGrader, and CompositeGrader, plus declarative BehaviorContracts.
Statistical rigor — pass@k for capability, pass^k for reliability, and bootstrap confidence intervals so signals aren't noise.
Baseline regression detection — canary, capability, and experimental baselines with severity-graded CI gates.
Harness-vs-agent separation — infra_error and grader_error rates are tracked separately, so a broken eval never looks like a failing agent.

pip install tracelens

See Installation for uv, extras ([llm], [http]), and development setup.

How the docs are organized¶

Get Started — install, run hello-world, and the example ladder.
Concepts — evaluation levels, the two reliability metrics, and accuracy best practices.
Guides — end-to-end walkthroughs for baselines, human calibration, and CI.
Reference — the auto-generated API Reference.
Contributing — testing tiers and the release process.