Evaluation Accuracy Best Practices¶
You ran an eval and got a number. Can you trust it? This page is about the difference between "we ran an eval" and "we trust this eval" — sample sizes, grader calibration, and the pitfalls that produce confident-but-wrong signals.
The Accuracy Problem¶
Agent evaluations suffer from three failure modes:
- False positives — agent passes but shouldn't (grader too lenient).
- False negatives — agent fails but should pass (grader too strict or flaky).
- Non-determinism noise — same input produces different results across runs.
A bad evaluation is worse than no evaluation — it builds false confidence or blocks valid work.
Sample Size Guidelines¶
The single most common accuracy mistake is too few samples. How many runs and tasks you need depends on the scenario:
Runs Per Task¶
| Scenario | Minimum Runs | Recommended |
|---|---|---|
| Deterministic agent + CodeGrader | 1 | 1 |
| Non-deterministic agent | 3 | 5–10 |
| LLMGrader (any agent) | 3 | 5 |
| High-stakes decision | 10 | 20+ |
Rule of thumb: if your confidence-interval width is > 0.1, you need more
runs. Put a CI on every metric with estimate_metric and watch ci_width — the
mechanics are in Statistical Comparison.
Tasks in Eval Set¶
| Eval Set Maturity | Minimum Tasks |
|---|---|
| Initial (prototype) | 10 |
| Development | 20–50 |
| Production CI | 50+ |
Start with real failure cases from your agent's history, not synthetic tasks.
Grader Calibration¶
A grader is only as trustworthy as its agreement with reality. Deterministic and LLM graders fail differently.
CodeGrader best practices¶
CodeGraders are deterministic and reproducible. Key design considerations:
- Metric design — choose metrics that capture what matters, not what's easy to measure.
- Threshold selection — set pass thresholds from observed performance, not guesses.
- Boundary testing — test grader behavior at edge cases (empty output,
None, extreme numbers). Grade defensively:
class FinancialGrader(CodeGrader):
def compute_metrics(self, transcript, task):
output = transcript.final_output
# Defensive: handle missing or malformed output
if not output or "returns" not in output:
return {"sharpe_ratio": 0.0, "max_drawdown": -1.0}
...
LLMGrader best practices¶
LLM judges are non-deterministic and prone to drift. Mitigate with:
- Deterministic decoding — judge at temperature 0 to reduce variability.
- Structured output — request JSON with specific fields, not free-form text.
- Rubric anchoring — define what each score level means concretely.
- Boundary examples — include 1–2 "definitely pass" and "definitely fail" examples in the prompt.
- Score normalization — map LLM scores (1–10) to 0–1 consistently.
def build_grading_prompt(self, transcript, task):
return """Score this output 1-10 on completeness.
SCORING RUBRIC:
- 1-3: Missing major required elements
- 4-6: Addresses task partially, notable gaps
- 7-8: Addresses all requirements adequately
- 9-10: Exceeds requirements with exceptional detail
EXAMPLES:
- Score 2: Output says "I don't know" (missing all elements)
- Score 9: Detailed plan with specific actions, timelines, resources
Return ONLY: {"score": N, "feedback": "brief explanation"}"""
Gate vs. score in a CompositeGrader¶
When you combine graders, decide which ones are hard gates and which contribute
to the score (set via GraderConfig's EvalPolicy):
| Grader purpose | Policy | Behavior |
|---|---|---|
| Safety / format validation | GATE |
Any violation fails the trial |
| Quality / style scoring | TRACK (or WARN) |
Contributes to the weighted score |
Use GATE sparingly — it creates binary signals. See the
Grader Library for the built-in graders and their default
policies, and Evaluating a Real Agent §4 for the gate-plus-judge
pattern.
Calibrating an LLM judge against humans¶
LLM graders drift; the only way to know is to compare them against human grades
on a periodic sample. TraceLens owns the selection and the agreement math
(tracelens sample / reconcile); you bring the human grades. The full
workflow, strategies, and the correlation thresholds to watch are in the
Human-Eval Calibration guide, with a runnable, API-key-free
demo in
examples/human_eval_calibration.py.
Handling Non-Determinism¶
Set run counts to the scenario (3 minimum for non-deterministic, 5 for CI, 10 for benchmarking), and always report both statistics:
gen = ReportGenerator(k_values=[1, 3, 5], consistency_k_values=[2, 3, 5])
report = gen.build_report(batch)
A high pass@k with a low pass^k means the agent is capable but inconsistent — a different problem than an agent that simply can't solve the task. See pass@k vs pass^k.
To make sure you're comparing like with like across runs, stamp each run with a
DecisionSpec so a difference is attributable to the agent/prompt/model rather
than the environment — see Reproducibility & DecisionSpec.
Baselines: setting thresholds from variance¶
Set regression thresholds from observed variance, not arbitrary numbers:
- Run your eval suite 5–10 times on the same agent version.
- Compute the standard deviation for each metric.
- Set the threshold at 2× standard deviation — this catches real regressions while tolerating natural variance.
# If observed std of pass_rate is 0.03, ~2x std ≈ 6%:
baseline.add_metric(metric_name="pass_rate", value=0.85, std=0.03, relative_threshold=0.06)
Storing, promoting, and comparing baselines (including canary baselines for safety-critical metrics that must never regress) is covered end to end in the Baseline Regression Tutorial.
Common Pitfalls¶
1. Grading paths, not outcomes¶
Wrong: check that the agent used specific tools in a specific order. Right: check that the final output meets requirements, regardless of how it got there.
2. Too few samples¶
Running 1 trial per task gives a binary signal with no statistical power. Run at least 3.
3. Stale baselines¶
Baselines established months ago may not reflect current expectations. Review stale baselines regularly rather than trusting them indefinitely.
4. LLM grader drift¶
LLM grader behavior changes when the underlying model is updated. After any model
upgrade: re-run calibration with human scores, compare old vs new grading on the
same transcripts, and update prompts if correlation drops below 0.7.
5. Ignoring pass^k¶
High pass@k can mask reliability problems. A task with pass@5 = 0.99 but pass^5 = 0.20 means the agent almost always succeeds eventually but fails 80% of the time when you need 5 consecutive successes. For production use, pass^k often matters more.
6. Overfitting the eval suite¶
If you only add tasks where the agent fails, the suite becomes a regression test, not a capability test. Regularly add new tasks from fresh failure cases and retire tasks that have passed consistently for months.
Evaluation Maturity Model¶
| Level | Description | Characteristics |
|---|---|---|
| 1 — Manual | Ad-hoc spot checking | No automation, no baselines |
| 2 — Basic | Automated eval suite | CodeGrader, num_runs=1, CI output |
| 3 — Statistical | Non-determinism handled | num_runs ≥ 3, pass@k + pass^k, baselines |
| 4 — Calibrated | Human-validated grading | Weekly calibration, LLMGrader correlation > 0.7 |
| 5 — Production | Full pipeline with dashboard | HTML dashboard, regression gating, DecisionSpec tracking, canary baselines |
Most teams should aim for Level 3 initially and progress to Level 4–5 as their agent matures.