Evaluation Recipes¶
TraceLens evaluates; it does not own your domain truth. The cleanest way to use it for a non-trivial system is a recipe: a thin, repeatable arrangement of the primitives TraceLens already provides, with the domain-specific parts owned by the projects that have the domain knowledge.
This pattern keeps TraceLens small. Most "can TraceLens do X?" questions are really "how do I arrange a recipe for X?" — and the answer is usually existing primitives plus a downstream grader, not a new TraceLens feature.
The three roles¶
Separate who computes the truth, who scores it, and who acts on it:
| Role | Owns | Examples |
|---|---|---|
| Producer | The canonical, deterministic evidence to be judged. TraceLens reads it but never recomputes it. | A backtest result, a recorded transcript, a search-result set, a frozen request/response fixture. |
| Evaluator (TraceLens) | Loading evidence as tasks, running graders, comparing to a baseline, confidence intervals, and the report. | Task, CodeGrader, RegressionDetector, BaselineManager, ReportGenerator. |
| Consumer | Deciding what to do with the report. | A CI gate, a promotion decision, an operator dashboard. |
The boundary matters: if TraceLens starts recomputing domain truth (simulating trades, re-running searches), it stops being a general evaluator and becomes a fork of the producer. Keep evidence as input.
Recipe shape¶
A recipe wires up these pieces. Only two of them ever live in TraceLens:
| Piece | Owner | Purpose |
|---|---|---|
| Evidence schema | Producer repo | The canonical facts TraceLens may score but not recompute. |
| Task mapping | Recipe / downstream | Wrap each evidence case into Task.input_data. |
| Deterministic grader | Recipe / downstream | Compute metrics, pass/fail, score, and reason codes. |
| Threshold config | Downstream repo | Promotion / regression thresholds for that project. |
| Report schema | TraceLens | Stable JSON fields for metrics, confidence, decision, reasons. |
| Rollout policy | Consumer repo | How to act on the decision. |
Prefer a documented recipe plus a stable report schema over a first-class TraceLens subcommand. Add a packaged command only after two downstream projects need the same evaluator — until then, a downstream repo can wrap the generic CLI with its own friendlier command.
Generic CLI¶
A recipe runs on the existing CLI; the domain logic lives behind the adapter and grader paths:
tracelens run \
--eval-set data/eval_set.json \
--adapter downstream_eval.adapters.EvidenceReplayAdapter \
--graders downstream_eval.graders.MyDomainGrader \
--num-runs 1 \
--output reports/eval_candidate_v2.json
The adapter replays producer evidence into a Transcript; the grader reads
Task.input_data, computes domain metrics, and returns an Outcome. TraceLens
handles the rest.
A three-way decision, not just pass/fail¶
For promotion gates, a binary pass/fail is often too blunt. A useful recipe maps its grader output to three decisions and lets the consumer act on each:
reject— the candidate is worse than baseline beyond the configured thresholds. Block it.review_first— directionally promising but with trade-offs that need a human. The default whenever evidence is mixed or the sample is small.auto_promote— broadly better, statistically defensible, with enough samples. Only here is unattended promotion appropriate.
Encode these as reason_codes in Outcome.metrics/feedback so the consumer
can act on them without parsing prose. Never auto_promote from a tiny sample —
default to review_first.
Worked examples¶
- Strategy gates — compare candidate decisions against canonical producer evidence, then let CI or a rollout system decide whether to block, review, or promote.
- Retrieval — compare a candidate chunking policy against a baseline using canonical search-result evidence and a deterministic relevance grader.
- Prompt routing — compare candidate routing rules against a baseline using frozen request fixtures and outcome labels.
In every case the producer owns domain truth, TraceLens owns evaluation and reporting, and the consumer owns the rollout decision.