Baseline Regression Tutorial¶

This tutorial takes you from a first passing TraceLens eval to a baseline comparison that can block CI. It uses a tiny pass-rate example so the baseline lifecycle is clear before you wire it into a real agent.

What You Will Build¶

By the end, you will have:

a first passing eval run and report;
a stored baseline in JSON;
a candidate run compared against that baseline;
an intentional failing candidate with expected regression output;
a promotion path for approved improvements;
CI wiring that blocks on moderate or severe regressions.

1. Run A First Passing Eval¶

Start with the repository hello-world eval:

python examples/hello_world.py
tracelens report --results examples/reports/hello_world_report.json --format markdown

Expected high-level result:

trials run : 9
pass rate  : 100%
report json: examples/reports/hello_world_report.json
sample md  : examples/reports/hello_world_report.md

The checked-in sample report is examples/reports/hello_world_report.md. It shows the pieces CI usually needs: pass rate, pass@k, pass^k, baseline comparison, regression result, and CI summary.

2. Choose Canary Or Capability Baselines¶

Use the baseline type to encode how cautious promotion should be.

Baseline type	Use it for	Example	Promotion rule
`CANARY`	A protected floor that should not drift automatically.	A safety task must keep `safety_score >= 0.95`.	Manual promotion only after review.
`CAPABILITY`	A moving benchmark for what the agent can currently do.	A math agent currently has `pass_rate = 0.92`.	Auto-promote only when the new run improves enough and has enough samples.

Concrete rule of thumb: if a drop means "do not ship," use a canary. If an improvement means "record the new normal," use a capability baseline.

3. Create And Store A Baseline¶

Create a capability baseline for the hello-world style pass-rate metric and a protected canary for a safety score:

from tracelens.baselines import BaselineManager, PromotionPolicy

manager = BaselineManager("eval/baselines/baselines.json")

manager.create_capability_baseline(
    task_id="math_agent",
    task_name="Math agent pass-rate benchmark",
    metrics={"pass_rate": 0.92},
    metric_stds={"pass_rate": 0.02},
    sample_size=20,
    promotion_policy=PromotionPolicy(
        allow_auto_promotion=True,
        min_improvement_relative=0.02,
        min_samples=20,
    ),
)

manager.create_canary_baseline(
    task_id="safety_contract",
    task_name="Critical safety response contract",
    metrics={"safety_score": 0.95},
    metric_stds={"safety_score": 0.01},
    sample_size=20,
    fingerprint="model-v1-prompt-a-tools-2026-05-24",
)

manager.save()

Commit the generated eval/baselines/baselines.json file. Treat it like a test fixture: changes should be reviewed, and the PR should explain why the baseline is being created or promoted.

4. Compare A Passing Candidate¶

A candidate run is just a list of metric dictionaries. A real integration would derive these values from TrialBatch outcomes, but the comparison API does not care where the metrics came from.

from tracelens.baselines import BaselineManager, RegressionDetector

manager = BaselineManager("eval/baselines/baselines.json")
baseline = manager.get_baseline("math_agent")

candidate_results = [
    {"pass_rate": 0.93},
    {"pass_rate": 0.94},
    {"pass_rate": 0.92},
    {"pass_rate": 0.93},
]

report = RegressionDetector(min_delta_percent=5.0).compare(
    baseline,
    candidate_results,
)

print(report.to_ci_output())
print(f"should_block_ci={report.should_block_ci()}")

Expected output:

No regressions detected
should_block_ci=False

5. Run An Intentional Failing Candidate¶

Now make the candidate worse by more than the default moderate threshold.

from tracelens.baselines import BaselineManager, RegressionDetector, RegressionSeverity

manager = BaselineManager("eval/baselines/baselines.json")
baseline = manager.get_baseline("math_agent")

candidate_results = [{"pass_rate": 0.84}] * 20

report = RegressionDetector(min_delta_percent=5.0).compare(
    baseline,
    candidate_results,
)

print(report.to_ci_output())
print(
    "should_block_ci=",
    report.should_block_ci(threshold=RegressionSeverity.MODERATE),
)

Expected regression output:

REGRESSION DETECTED [MODERATE]

  pass_rate: 0.9200 -> 0.8400 (-8.7%)
should_block_ci= True

That is the CI gate: a pull request can exit non-zero when report.should_block_ci(threshold=RegressionSeverity.MODERATE) returns True.

6. Promote An Approved Improvement¶

Promotion changes the stored baseline. Do it only after deciding the new result is trustworthy.

from tracelens.baselines import BaselineManager

manager = BaselineManager("eval/baselines/baselines.json")

promoted, reason = manager.try_promote(
    task_id="math_agent",
    current_metrics={"pass_rate": 0.95},
    metric_stds={"pass_rate": 0.01},
    sample_size=25,
    fingerprint="model-v2-prompt-b-tools-2026-05-24",
)

print(promoted, reason)

if promoted:
    manager.save()

Expected output:

True Baseline promoted successfully

For canaries, prefer manual review. CANARY baselines deliberately do not auto-promote; use force_promote(...) only after a maintainer approves the new floor and the associated DecisionSpec fingerprint.

7. Wire The Comparison Into CI¶

In a downstream project, run your eval harness on pull requests and exit with a failure when the regression report blocks:

import sys

from tracelens.baselines import BaselineManager, RegressionDetector, RegressionSeverity

manager = BaselineManager("eval/baselines/baselines.json")
baseline = manager.get_baseline("math_agent")
candidate_results = load_candidate_results()  # Your eval harness owns this.

report = RegressionDetector(min_delta_percent=5.0).compare(
    baseline,
    candidate_results,
)

print(report.to_ci_output())

if report.should_block_ci(threshold=RegressionSeverity.MODERATE):
    sys.exit(1)

Use the GitHub Actions template in docs/ci-cd-integration.md for checkout, Python setup, dependency installation, artifacts, and PR comments.

Use the sample report at examples/reports/hello_world_report.md as the shape to aim for when posting CI summaries: include the metric, baseline value, current value, delta, regression result, and blocking decision.

Baseline Review Checklist¶

Before committing a baseline change, confirm:

the baseline type matches the risk: CANARY for protected floors, CAPABILITY for moving benchmarks;
sample count is high enough for the decision;
the metric direction is correct;
the candidate was run under the same relevant DecisionSpec and infrastructure configuration;
the PR explains whether the change creates, compares, or promotes a baseline;
CI artifacts link to the report used for the decision.