Baseline Regression Tutorial¶
This tutorial takes you from a first passing TraceLens eval to a baseline comparison that can block CI. It uses a tiny pass-rate example so the baseline lifecycle is clear before you wire it into a real agent.
What You Will Build¶
By the end, you will have:
- a first passing eval run and report;
- a stored baseline in JSON;
- a candidate run compared against that baseline;
- an intentional failing candidate with expected regression output;
- a promotion path for approved improvements;
- CI wiring that blocks on moderate or severe regressions.
1. Run A First Passing Eval¶
Start with the repository hello-world eval:
python examples/hello_world.py
tracelens report --results examples/reports/hello_world_report.json --format markdown
Expected high-level result:
trials run : 9
pass rate : 100%
report json: examples/reports/hello_world_report.json
sample md : examples/reports/hello_world_report.md
The checked-in sample report is
examples/reports/hello_world_report.md.
It shows the pieces CI usually needs: pass rate, pass@k, pass^k, baseline
comparison, regression result, and CI summary.
2. Choose Canary Or Capability Baselines¶
Use the baseline type to encode how cautious promotion should be.
| Baseline type | Use it for | Example | Promotion rule |
|---|---|---|---|
CANARY |
A protected floor that should not drift automatically. | A safety task must keep safety_score >= 0.95. |
Manual promotion only after review. |
CAPABILITY |
A moving benchmark for what the agent can currently do. | A math agent currently has pass_rate = 0.92. |
Auto-promote only when the new run improves enough and has enough samples. |
Concrete rule of thumb: if a drop means "do not ship," use a canary. If an improvement means "record the new normal," use a capability baseline.
3. Create And Store A Baseline¶
Create a capability baseline for the hello-world style pass-rate metric and a protected canary for a safety score:
from tracelens.baselines import BaselineManager, PromotionPolicy
manager = BaselineManager("eval/baselines/baselines.json")
manager.create_capability_baseline(
task_id="math_agent",
task_name="Math agent pass-rate benchmark",
metrics={"pass_rate": 0.92},
metric_stds={"pass_rate": 0.02},
sample_size=20,
promotion_policy=PromotionPolicy(
allow_auto_promotion=True,
min_improvement_relative=0.02,
min_samples=20,
),
)
manager.create_canary_baseline(
task_id="safety_contract",
task_name="Critical safety response contract",
metrics={"safety_score": 0.95},
metric_stds={"safety_score": 0.01},
sample_size=20,
fingerprint="model-v1-prompt-a-tools-2026-05-24",
)
manager.save()
Commit the generated eval/baselines/baselines.json file. Treat it like a test
fixture: changes should be reviewed, and the PR should explain why the baseline
is being created or promoted.
4. Compare A Passing Candidate¶
A candidate run is just a list of metric dictionaries. A real integration would
derive these values from TrialBatch outcomes, but the comparison API does not
care where the metrics came from.
from tracelens.baselines import BaselineManager, RegressionDetector
manager = BaselineManager("eval/baselines/baselines.json")
baseline = manager.get_baseline("math_agent")
candidate_results = [
{"pass_rate": 0.93},
{"pass_rate": 0.94},
{"pass_rate": 0.92},
{"pass_rate": 0.93},
]
report = RegressionDetector(min_delta_percent=5.0).compare(
baseline,
candidate_results,
)
print(report.to_ci_output())
print(f"should_block_ci={report.should_block_ci()}")
Expected output:
5. Run An Intentional Failing Candidate¶
Now make the candidate worse by more than the default moderate threshold.
from tracelens.baselines import BaselineManager, RegressionDetector, RegressionSeverity
manager = BaselineManager("eval/baselines/baselines.json")
baseline = manager.get_baseline("math_agent")
candidate_results = [{"pass_rate": 0.84}] * 20
report = RegressionDetector(min_delta_percent=5.0).compare(
baseline,
candidate_results,
)
print(report.to_ci_output())
print(
"should_block_ci=",
report.should_block_ci(threshold=RegressionSeverity.MODERATE),
)
Expected regression output:
That is the CI gate: a pull request can exit non-zero when
report.should_block_ci(threshold=RegressionSeverity.MODERATE) returns True.
6. Promote An Approved Improvement¶
Promotion changes the stored baseline. Do it only after deciding the new result is trustworthy.
from tracelens.baselines import BaselineManager
manager = BaselineManager("eval/baselines/baselines.json")
promoted, reason = manager.try_promote(
task_id="math_agent",
current_metrics={"pass_rate": 0.95},
metric_stds={"pass_rate": 0.01},
sample_size=25,
fingerprint="model-v2-prompt-b-tools-2026-05-24",
)
print(promoted, reason)
if promoted:
manager.save()
Expected output:
For canaries, prefer manual review. CANARY baselines deliberately do not
auto-promote; use force_promote(...) only after a maintainer approves the new
floor and the associated DecisionSpec fingerprint.
7. Wire The Comparison Into CI¶
In a downstream project, run your eval harness on pull requests and exit with a failure when the regression report blocks:
import sys
from tracelens.baselines import BaselineManager, RegressionDetector, RegressionSeverity
manager = BaselineManager("eval/baselines/baselines.json")
baseline = manager.get_baseline("math_agent")
candidate_results = load_candidate_results() # Your eval harness owns this.
report = RegressionDetector(min_delta_percent=5.0).compare(
baseline,
candidate_results,
)
print(report.to_ci_output())
if report.should_block_ci(threshold=RegressionSeverity.MODERATE):
sys.exit(1)
Use the GitHub Actions template in
docs/ci-cd-integration.md for
checkout, Python setup, dependency installation, artifacts, and PR comments.
Use the sample report at
examples/reports/hello_world_report.md
as the shape to aim for when posting CI summaries: include the metric, baseline
value, current value, delta, regression result, and blocking decision.
Baseline Review Checklist¶
Before committing a baseline change, confirm:
- the baseline type matches the risk:
CANARYfor protected floors,CAPABILITYfor moving benchmarks; - sample count is high enough for the decision;
- the metric direction is correct;
- the candidate was run under the same relevant
DecisionSpecand infrastructure configuration; - the PR explains whether the change creates, compares, or promotes a baseline;
- CI artifacts link to the report used for the decision.