Getting Started in Five Minutes¶

This is the fastest path from a fresh checkout to your first eval run. No LLM keys, no config files, no boilerplate. After this guide you'll understand the four-piece skeleton every TraceLens run uses, and you'll know which example to read next when you have a real agent.

1. Install¶

For using TraceLens as a dependency:

# Recommended: uv
uv pip install tracelens

# Or: plain pip
pip install tracelens

For local development and to run the repository examples:

git clone https://github.com/ssf0409/tracelens.git
cd tracelens
uv pip install -e ".[dev]"

The [dev] extra pulls in pytest, ruff, mypy, and type stubs so the standard local verification commands run out of the box: pytest -q, ruff check src/ tests/, and mypy src/tracelens/.

The published package gives you the tracelens library and CLI. The checkout gives you the examples/ files used in the demo below.

2. Run hello-world¶

python examples/hello_world.py

You should see:

tracelens hello-world
--------------------
trials run : 9
pass rate  : 100%
report json: examples/reports/hello_world_report.json
sample md  : examples/reports/hello_world_report.md

  add-2-2                  runs=3  pass_rate=100% output='4'
  add-10-5                 runs=3  pass_rate=100% output='15'
  add-7-8                  runs=3  pass_rate=100% output='15'

Render the generated report through the CLI:

tracelens report --results examples/reports/hello_world_report.json --format markdown

Then read the checked-in sample: examples/reports/hello_world_report.md. It includes tasks, trials, pass@k, pass^k, graders, baseline comparison, regression result, and CI summary.

That's the entire framework, end to end. One small file, sub-second runtime, no external services. Open examples/hello_world.py in your editor and read it top to bottom — the comments map every line to the four-piece skeleton.

3. Scaffold Your Own Eval¶

When you are ready to start in your own repo, generate a runnable starter suite:

tracelens init .
tracelens run \
  --eval-set eval/tasks.json \
  --adapter eval.adapter.StarterAdapter \
  --graders eval.grader.StarterGrader \
  --output eval/results/results.json \
  --report eval/results/report.md \
  --save-trials eval/results/trials.json

This writes eval/tasks.json, eval/adapter.py, eval/grader.py, an eval/README.md, and .github/workflows/eval.yml. The generated adapter and grader pass immediately so you can replace one piece at a time.

Run tracelens init . --force only when you want to overwrite the generated files.

4. The Four-Piece Skeleton¶

Every TraceLens run combines four pieces:

Piece	What it answers	Concrete classes
Task	What does the agent need to do?	`Task`, `EvalSet`
Adapter	How do I invoke the agent?	`SimpleAdapter`, `HTTPAPIAdapter`, custom subclass of `AgentAdapter`
Grader	Did the agent get it right?	`CodeGrader` (deterministic), `LLMGrader` (judge model), or auto-generated from a `BehaviorContract`
Runner	Drive the run, parallelise, collect results.	`EvaluationRunner`, `RunnerConfig`

If you can describe each piece in a sentence for your agent, you're ready to write your own eval.

5. The Example Ladder¶

The examples in examples/ go from trivial to production-grade, each adding exactly one new concept. Read them in order:

Step	File	New concept
1	`hello_world.py`	The four-piece skeleton, in 50 lines, no LLM.
2	`contract_eval.py`	`BehaviorContract.to_graders()` — declare the contract once, get a full grader suite for free.
3	`http_agent_eval.py`	`HTTPAPIAdapter` for evaluating a remote agent over HTTP, plus `JsonSchemaGrader` for output-shape gating.
4	`noise_aware_regression.py`	`DecisionSpec` fingerprinting, `RegressionDetector`, and the 3 percentage-point infra-noise band — the production CI gate.
5	`llm_provider_examples.py`	LLM-as-judge: wire OpenAI/Anthropic SDKs into `LLMGrader` via `LLMProvider` (runs offline with a fake provider).
6	`human_eval_calibration.py`	Reconcile an automated grader against human grades to detect LLM-judge drift (no API keys).
7	`version_compare.py`	Compare two versions (model/prompt) with bootstrap significance + `DecisionSpec` fingerprints (no API keys).

Each example is self-contained — running it directly gives you working output. Copy the one that matches your problem and edit from there.

6. Where to Go Next¶

Your immediate next step is to write your own eval:

tracelens init . — create a runnable starter suite in your repo.
Build Your First Eval — author your own Task, custom grader, and CLI workflow when you want to understand each file by hand.
Core Concepts & Glossary — the full pipeline and every object (Transcript, Outcome, Trial, TrialBatch) on one page.

When your agent is real and your eval set has grown, move on to:

Evaluating a Real Agent — the end-to-end walkthrough for a realistic HTTP/LLM agent: schema gate + LLM judge, reading transcripts, capability vs reliability, and a baseline.
User Guide — the decision guide: choosing the right task scope, adapter, grader, and statistics for your situation.
Supported Scenarios — which agent-evaluation problems TraceLens fits, and which first example to copy.
Evaluation Levels — function vs task vs system-level evaluation; pass@k vs pass^k semantics.
pass@k vs pass^k — capability vs reliability, with a truth table and which metric belongs in a CI gate.
Accuracy Best Practices — how to keep LLM-judge graders calibrated to humans (the difference between "we ran an eval" and "we trust this eval").
CI/CD Integration — wiring TraceLens into GitHub Actions with regression gating.
High-Stakes Autonomous Benchmark — the flagship benchmark pack that demonstrates TraceLens's infra-noise-aware regression detection on safety-critical tasks.

7. The 60-Second Mental Model¶

TraceLens is opinionated about two things, and ergonomic about everything else:

Grade outcomes, not paths. A CodeGrader looks at transcript.final_output. It doesn't care which tools the agent called or in what order — that's an implementation detail.
Reproducibility is a first-class config. Every run carries a DecisionSpec (model, prompt, tools, infra). Two runs with the same fingerprint should produce statistically similar results; when they don't, regression detection knows whether to blame the agent or the infrastructure.

Everything else — async vs sync, single agent vs HTTP, code grader vs LLM judge — is a knob you can turn without rewriting your eval set.

That's it. Run python examples/hello_world.py, render the JSON report, then open the sample Markdown report.