Skip to content

TraceLens

Evaluation and regression-testing for AI agents. TraceLens turns agent runs into inspectable traces, graded outcomes, baseline comparisons, calibration data, and CI-ready reliability signals.

It is deliberately domain-agnostic: TraceLens evaluates evidence, while your project owns the task data, agent invocation, graders, and rollout policy.


Start here

  • Getting Started

    From a fresh checkout to your first eval run in five minutes — no LLM keys, no config files.

  • Core Concepts & Glossary

    The evaluation pipeline and every object — Task, Transcript, Outcome, Trial — explained on one page.

  • pass@k vs pass^k

    Capability vs reliability, with a truth table and which metric belongs in a CI gate.

  • CI/CD Integration

    Wire TraceLens into GitHub Actions with regression gating. Copy-paste workflow included.

  • User Guide

    Every public API explained, with decision trees for graders, adapters, and analysis methods.


What TraceLens gives you

New to the vocabulary below? The Core Concepts & Glossary page defines every object and shows how they connect.

  • Inspectable traces — every run becomes a Transcript you can read.
  • Outcome grading — deterministic CodeGrader, LLM-as-judge LLMGrader, and CompositeGrader, plus declarative BehaviorContracts.
  • Statistical rigorpass@k for capability, pass^k for reliability, and bootstrap confidence intervals so signals aren't noise.
  • Baseline regression detection — canary, capability, and experimental baselines with severity-graded CI gates.
  • Harness-vs-agent separationinfra_error and grader_error rates are tracked separately, so a broken eval never looks like a failing agent.
pip install tracelens

See Installation for uv, extras ([llm], [http]), and development setup.


How the docs are organized

  • Get Started — install, run hello-world, and the example ladder.
  • Concepts — evaluation levels, the two reliability metrics, and accuracy best practices.
  • Guides — end-to-end walkthroughs for baselines, human calibration, and CI.
  • Reference — the auto-generated API Reference.
  • Contributing — testing tiers and the release process.