Skip to content

Evaluation Recipes

TraceLens evaluates; it does not own your domain truth. The cleanest way to use it for a non-trivial system is a recipe: a thin, repeatable arrangement of the primitives TraceLens already provides, with the domain-specific parts owned by the projects that have the domain knowledge.

This pattern keeps TraceLens small. Most "can TraceLens do X?" questions are really "how do I arrange a recipe for X?" — and the answer is usually existing primitives plus a downstream grader, not a new TraceLens feature.

The three roles

Separate who computes the truth, who scores it, and who acts on it:

Role Owns Examples
Producer The canonical, deterministic evidence to be judged. TraceLens reads it but never recomputes it. A backtest result, a recorded transcript, a search-result set, a frozen request/response fixture.
Evaluator (TraceLens) Loading evidence as tasks, running graders, comparing to a baseline, confidence intervals, and the report. Task, CodeGrader, RegressionDetector, BaselineManager, ReportGenerator.
Consumer Deciding what to do with the report. A CI gate, a promotion decision, an operator dashboard.

The boundary matters: if TraceLens starts recomputing domain truth (simulating trades, re-running searches), it stops being a general evaluator and becomes a fork of the producer. Keep evidence as input.

Recipe shape

A recipe wires up these pieces. Only two of them ever live in TraceLens:

Piece Owner Purpose
Evidence schema Producer repo The canonical facts TraceLens may score but not recompute.
Task mapping Recipe / downstream Wrap each evidence case into Task.input_data.
Deterministic grader Recipe / downstream Compute metrics, pass/fail, score, and reason codes.
Threshold config Downstream repo Promotion / regression thresholds for that project.
Report schema TraceLens Stable JSON fields for metrics, confidence, decision, reasons.
Rollout policy Consumer repo How to act on the decision.

Prefer a documented recipe plus a stable report schema over a first-class TraceLens subcommand. Add a packaged command only after two downstream projects need the same evaluator — until then, a downstream repo can wrap the generic CLI with its own friendlier command.

Generic CLI

A recipe runs on the existing CLI; the domain logic lives behind the adapter and grader paths:

tracelens run \
  --eval-set data/eval_set.json \
  --adapter downstream_eval.adapters.EvidenceReplayAdapter \
  --graders downstream_eval.graders.MyDomainGrader \
  --num-runs 1 \
  --output reports/eval_candidate_v2.json

The adapter replays producer evidence into a Transcript; the grader reads Task.input_data, computes domain metrics, and returns an Outcome. TraceLens handles the rest.

A three-way decision, not just pass/fail

For promotion gates, a binary pass/fail is often too blunt. A useful recipe maps its grader output to three decisions and lets the consumer act on each:

  • reject — the candidate is worse than baseline beyond the configured thresholds. Block it.
  • review_first — directionally promising but with trade-offs that need a human. The default whenever evidence is mixed or the sample is small.
  • auto_promote — broadly better, statistically defensible, with enough samples. Only here is unattended promotion appropriate.

Encode these as reason_codes in Outcome.metrics/feedback so the consumer can act on them without parsing prose. Never auto_promote from a tiny sample — default to review_first.

Worked examples

  • Strategy gates — compare candidate decisions against canonical producer evidence, then let CI or a rollout system decide whether to block, review, or promote.
  • Retrieval — compare a candidate chunking policy against a baseline using canonical search-result evidence and a deterministic relevance grader.
  • Prompt routing — compare candidate routing rules against a baseline using frozen request fixtures and outcome labels.

In every case the producer owns domain truth, TraceLens owns evaluation and reporting, and the consumer owns the rollout decision.