Skip to content

API Reference

Auto-generated from docstrings. Everything below is importable from the package root (from tracelens import ...) and covered by the stability policy — submodule paths may move, so import from the root for anything you depend on long-term.

Tip

New here? Read Core Concepts & Glossary first for the mental model, then the User Guide for guided usage. This page is the exhaustive index.

Core models

tracelens.Task

Bases: BaseModel

Represents a single evaluation task/test case.

A Task defines: - Input data to feed to the agent - Optional expected outputs for validation - Metadata for filtering and categorization - Configuration for execution

Example

task = Task( name="Portfolio website decomposition", input_data={ "goal": "Build a personal portfolio website", "user_context": {"experience": "beginner", "hours_per_week": 15} }, category="programming", tags=["web", "beginner"], )

matches_filter(tags=None, categories=None, difficulties=None)

Check if task matches filter criteria.

tracelens.TaskLoader

Bases: ABC

Abstract base class for loading tasks from various sources.

load(source) abstractmethod

Load tasks from a source (file, directory, database, etc.).

save(tasks, destination) abstractmethod

Save tasks to a destination.

tracelens.JSONTaskLoader

Bases: TaskLoader

Load tasks from JSON files.

Supports: - Single file with {"tasks": [...]} or single task object - Directory of JSON files

load(source)

Load tasks from JSON file(s).

save(tasks, destination)

Save tasks to JSON file.

tracelens.EvalSet

Bases: BaseModel

Collection of tasks for evaluation.

An EvalSet groups related tasks together for batch evaluation. It also defines default configuration for all tasks in the set.

Example

eval_set = EvalSet( name="Goal Decomposition Suite v1", tasks=tasks, default_num_runs=5, default_grader_ids=["quality", "personalization"], )

filter_tasks(tags=None, categories=None, difficulties=None, max_tasks=None)

Filter tasks by criteria.

filtered_eval_set(tags=None, categories=None, difficulties=None, max_tasks=None)

Return a new EvalSet with only tasks matching the filter criteria.

add_task(task)

Add a task to the set.

remove_task(task_id)

Remove a task by ID. Returns True if found and removed.

get_task(task_id)

Get a task by ID.

tracelens.Transcript

Bases: BaseModel

Complete record of agent execution for a task.

A Transcript captures everything that happened during an agent's execution, providing a complete audit trail for grading and debugging.

Example

transcript = Transcript( transcript_id=str(uuid.uuid4()), task_id=task.task_id, agent_name="goal_decomposition", started_at=utc_now(), ) transcript.steps.append(step) transcript.final_output = result transcript.completed_at = utc_now()

has_streaming_data property

Whether this transcript contains streaming events.

first_token_latency_ms property

Time to first token event in milliseconds, or None if no streaming.

streaming_duration_ms property

Total streaming duration from first to last event, or None if no streaming.

streaming_token_count property

Total tokens across all streaming TOKEN events.

duration_ms property

Calculate execution duration in milliseconds.

total_tokens property

Calculate total token usage across all steps.

input_tokens property

Calculate total input tokens.

output_tokens property

Calculate total output tokens.

has_errors property

Check if any errors occurred during execution.

llm_calls_count property

Count LLM calls.

tool_calls_count property

Count tool calls.

add_streaming_event(event)

Append a streaming event to the transcript.

add_step(step)

Add a step to the transcript.

get_tool_calls_by_name(name)

Get all tool calls with a specific name.

get_steps_by_type(step_type)

Get all steps of a specific type.

to_dict()

Serialize to a JSON-safe dict with full round-trip fidelity.

from_dict(data) classmethod

Reconstruct a Transcript from a dict produced by to_dict().

to_summary()

Create a summary dict for reporting.

tracelens.TranscriptStep

Bases: BaseModel

A single step in agent execution.

Steps are the atomic units of a transcript. Each step represents one action the agent took (LLM call, tool call, etc.).

is_error property

Check if this step is an error.

tracelens.StreamingEvent

Bases: BaseModel

A single event in a streaming response.

Timestamps are in milliseconds since stream start for precise inter-token latency analysis.

tracelens.StreamingEventType

Bases: StrEnum

Types of streaming events for real-time token delivery.

tracelens.Outcome

Bases: BaseModel

Result of grading a trial.

An Outcome contains: - Primary pass/fail determination - Normalized score (0-1) - Detailed metrics dict (grader-specific) - Optional feedback and reasoning

Example

outcome = Outcome( trial_id="...", grader_id="quality", passed=True, score=0.85, metrics={"specificity": 0.9, "personalization": 0.8}, feedback="Tasks are specific but could use more context references", )

model_post_init(__context)

Set grade_level from score if not provided.

to_summary_dict()

Return a summary suitable for reporting.

to_ci_dict()

Return a compact dict for CI output.

tracelens.GradeLevel

Bases: str, Enum

Categorical grade levels for human-readable results.

from_score(score) classmethod

Convert a normalized score to a grade level.

tracelens.Trial

Bases: BaseModel

A single execution of a task.

A Trial tracks: - Which task is being executed - The run index (for pass@k with multiple runs) - The execution transcript - Grading outcomes from all graders - Status and timing

Example

trial = Trial( task_id=task.task_id, run_index=0, total_runs=5, ) trial.status = TrialStatus.RUNNING trial.started_at = utc_now()

... execute agent ...

trial.transcript = transcript trial.status = TrialStatus.COMPLETED trial.completed_at = utc_now()

duration_ms property

Calculate trial duration in milliseconds.

passed property

Trial passes if ALL outcomes pass.

aggregate_score property

Average score across all outcomes.

is_complete property

Check if trial has finished (successfully or not).

is_successful property

Check if trial completed without errors.

has_grader_error property

Whether any grader crashed while grading this trial.

Grader crashes are synthesized as failed outcomes so the trial stays conservative (not passed), but they must be counted separately — they measure the eval harness, not the agent.

is_infra_failure property

Whether this trial failed due to infrastructure, not the agent.

Infra failures (OOM kills, network errors, sandbox terminations) should be counted separately from task failures when interpreting scores — otherwise noisy infra inflates the apparent failure rate.

fingerprint property

Get decision spec fingerprint from transcript.

fingerprint_short property

Get short decision spec fingerprint from transcript.

add_outcome(outcome)

Add a grading outcome to this trial.

get_outcome_by_grader(grader_id)

Get outcome from a specific grader.

get_metric(metric_name)

Get a specific metric value from any outcome.

to_summary_dict()

Return a summary suitable for reporting.

to_ci_dict()

Return a compact dict for CI output.

tracelens.TrialBatch

Bases: BaseModel

Collection of trials for batch processing.

Useful for running multiple trials in parallel and aggregating results.

total_count property

Total number of trials.

completed_count property

Number of completed trials.

passed_count property

Number of passed trials.

pass_rate property

Pass rate across all trials.

infra_error_count property

Number of trials that failed due to infrastructure issues.

infra_error_rate property

Fraction of trials that hit infrastructure failures.

Report this alongside pass_rate: Anthropic's infra-noise study found that infra error rates can move by 5+ percentage points purely from resource-configuration changes. A spike in infra_error_rate between two baselines is a strong hint that the regression is noise, not a real capability drop.

total_input_tokens property

Total input tokens across all trial transcripts.

total_output_tokens property

Total output tokens across all trial transcripts.

total_tokens property

Total tokens (input + output) across all trial transcripts.

grader_error_count property

Number of trials where at least one grader crashed.

grader_error_rate property

Fraction of trials affected by grader crashes.

Report this alongside pass_rate: a spike here means the grading harness is broken, not that the agent regressed.

all_complete property

Check if all trials are complete.

add_trial(trial)

Add a trial to the batch.

get_trials_for_task(task_id)

Get all trials for a specific task.

to_dict()

Serialize to a JSON-safe dict with full round-trip fidelity.

from_dict(data) classmethod

Reconstruct a TrialBatch from a dict produced by to_dict().

get_pass_results_by_task()

Get pass/fail results grouped by task.

Returns dict mapping task_id to list of boolean pass results. Useful for computing pass@k.

tracelens.TrialStatus

Bases: str, Enum

Status of a trial.

tracelens.InfraError

Bases: Exception

Raised by adapters when a failure is known to be infrastructural.

Separating infra failures from task-level failures matters because they mean different things for evaluation scores. A pod killed for exceeding its memory limit tells you nothing about the agent's capability — but it does tell you the eval's resource configuration is too tight (see Anthropic's "Quantifying infrastructure noise in agentic coding evals", which measured infra error rates dropping from 5.8% at strict enforcement to 0.5% uncapped).

When the runner catches this exception (or other known-infra exception types like MemoryError, ConnectionError, TimeoutError from the network stack, or OSError), the trial's status is set to TrialStatus.INFRA_ERROR rather than FAILED, and the infra error rate is surfaced separately in reports so you can decide whether a regression is real or a noise artefact.

Example

class MyAdapter(AgentAdapter): async def run(self, task): try: return await do_work(task) except httpx.ConnectError as exc: raise InfraError(f"upstream API unreachable: {exc}") from exc

Execution

tracelens.AgentAdapter

Bases: ABC

Abstract base class for agent adapters.

Adapters bridge the evaluation runner to the agent being evaluated. Implement run() to invoke your agent and return a Transcript.

Optionally override setup() and teardown() for lifecycle management. The runner guarantees teardown is called even if run() fails.

Example

class MyAdapter(AgentAdapter): async def setup(self, task: Task) -> None: self.db = await create_test_database()

async def run(self, task: Task) -> Transcript:
    result = await my_agent.invoke(task.input_data)
    transcript = self.start_transcript(task)
    transcript.final_output = result
    transcript.completed_at = utc_now()
    return transcript

async def teardown(self, task: Task, transcript: Transcript | None) -> None:
    await self.db.cleanup()

setup(task) async

Called before run(). Override for preparation. Default: no-op.

teardown(task, transcript) async

Called after run(), even on failure. Override for cleanup. Default: no-op.

run(task) abstractmethod async

Run the agent on a task and return a transcript.

start_transcript(task)

Helper to create a Transcript with timing started.

record_error(transcript, error)

Helper to record an exception in a transcript.

tracelens.SimpleAdapter

Bases: AgentAdapter

Wraps any async callable as an AgentAdapter.

Useful for testing and simple single-shot agents that take input_data and return a result.

Example

async def my_fn(input_data: dict) -> dict: return {"answer": "42"}

adapter = SimpleAdapter(my_fn)

run(task) async

Invoke the wrapped function and build a transcript.

tracelens.HTTPAPIAdapter

Bases: AgentAdapter

Adapter that invokes agents via HTTP API calls.

Supports authentication, retry with exponential backoff, and customizable request/response handling.

Example

config = HTTPAdapterConfig( base_url="https://api.example.com", endpoint="/v1/agent/run", auth=AuthConfig(scheme=AuthScheme.BEARER, token="sk-..."), ) adapter = HTTPAPIAdapter(config)

Use with EvaluationRunner

build_request_body(task)

Build the HTTP request body from a task. Override to customize.

parse_response_body(data)

Extract the agent's answer from response JSON. Override to customize.

run(task) async

Invoke the HTTP agent and return a transcript.

close() async

Close the underlying httpx client.

teardown(task, transcript) async

No-op per trial — call close() explicitly when done with all trials.

tracelens.HTTPAdapterConfig

Bases: BaseModel

Full configuration for HTTPAPIAdapter.

tracelens.AuthConfig

Bases: BaseModel

Authentication configuration for HTTP requests.

apply_to_headers(headers)

Apply auth config to a headers dict, returning the updated dict.

tracelens.AuthScheme

Bases: StrEnum

Supported authentication schemes.

tracelens.RetryConfig

Bases: BaseModel

Retry configuration with exponential backoff.

tracelens.EvaluationRunner

Runs evaluations with concurrency control and timeout enforcement.

Example

runner = EvaluationRunner( adapter=my_adapter, graders=[quality_grader, safety_grader], config=RunnerConfig(num_runs=5, max_concurrency=10), ) batch = await runner.run(eval_set) print(f"Pass rate: {batch.pass_rate:.2%}")

run(eval_set) async

Run all tasks × runs and grade results.

tracelens.RunnerConfig dataclass

Configuration for the evaluation runner.

Graders — base classes

tracelens.Grader

Bases: ABC

Abstract base class for all graders.

Graders evaluate agent outputs (Transcripts) and produce Outcomes. Subclass either CodeGrader or LLMGrader for specific implementations.

Example

class MyGrader(CodeGrader): def compute_metrics(self, transcript, task): return {"accuracy": 0.95}

def determine_pass(self, metrics, task):
    return metrics["accuracy"] >= 0.9, metrics["accuracy"]

grader_type abstractmethod property

Return the type of this grader.

is_deterministic property

Whether this grader produces deterministic results.

requires_llm property

Whether this grader requires LLM calls.

requires_human property

Whether this grader requires human input.

role property

Role of this grader in composite scoring.

is_must_pass property

Whether this grader must pass for trial to pass.

is_score_contributor property

Whether this grader contributes to score average.

policy property

Three-way eval policy for this grader.

is_gate property

Whether this grader is a gate (fails CI on violation).

is_warn property

Whether this grader is a warning.

is_track property

Whether this grader is tracking-only.

grade(transcript, task) abstractmethod async

Grade the transcript for the given task.

Parameters:

Name Type Description Default
transcript Transcript

The agent's execution record

required
task Task

The task being evaluated

required

Returns:

Type Description
Outcome

An Outcome with pass/fail, score, and metrics

create_outcome(trial_id, passed, score, metrics=None, feedback=None, **kwargs)

Helper to create an Outcome with common fields.

tracelens.CodeGrader

Bases: Grader

Base class for deterministic code-based graders.

CodeGraders compute metrics from the transcript and then determine pass/fail based on those metrics. They are deterministic - the same input always produces the same output.

Use for: objective metrics (Sharpe ratio, accuracy, latency)

Example

class FinancialGrader(CodeGrader): def compute_metrics(self, transcript, task): returns = transcript.final_output["returns"] return { "sharpe_ratio": calculate_sharpe(returns), "max_drawdown": calculate_max_dd(returns), }

def determine_pass(self, metrics, task):
    passed = metrics["sharpe_ratio"] >= 1.0
    score = min(metrics["sharpe_ratio"] / 2.0, 1.0)
    return passed, score

compute_metrics(transcript, task) abstractmethod

Compute grading metrics from transcript.

Implement this to extract and calculate metrics from the agent's execution record.

determine_pass(metrics, task) abstractmethod

Determine pass/fail and score from metrics.

Returns:

Type Description
tuple[bool, float]

Tuple of (passed, score) where score is 0-1 normalized.

grade(transcript, task) async

Grade by computing metrics and determining pass/fail.

tracelens.LLMGrader

Bases: Grader

Base class for LLM-as-judge graders.

LLMGraders use an LLM to evaluate agent outputs. They are non-deterministic and require careful prompt engineering.

Use for: subjective quality (specificity, personalization, clarity)

Example

class QualityGrader(LLMGrader): def build_grading_prompt(self, transcript, task): return f'''Evaluate the quality of this output:

    Score 1-10 on: clarity, completeness, accuracy
    Return JSON: {{"score": X, "feedback": "..."}}'''

def parse_llm_response(self, response, task):
    data = json.loads(response)
    score = data["score"] / 10.0
    passed = score >= 0.7
    return passed, score, {}, data["feedback"]

build_grading_prompt(transcript, task) abstractmethod

Build the prompt for LLM grading.

Implement this to create a prompt that instructs the LLM how to evaluate the agent's output.

parse_llm_response(response, task) abstractmethod

Parse LLM response into structured result.

Returns:

Type Description
tuple[bool, float, dict[str, float], str]

Tuple of (passed, score, metrics, feedback)

grade(transcript, task) async

Grade by calling LLM and parsing response.

Honors GraderConfig: each attempt (LLM call + parse) is bounded by timeout_seconds; transient failures — including malformed responses, which a fresh LLM call often fixes — are retried up to max_retries times with exponential backoff when retry_on_error is set. NotImplementedError (no provider configured) is a setup bug, never retried.

tracelens.CompositeGrader

Bases: Grader

Combines multiple graders with policy-aware aggregation.

Supports three policy tiers: - GATE: Any failure causes entire trial to fail (safety, constraints) - WARN: Failures emit warnings but don't fail by default - TRACK: Pure signals, never affect pass/fail

Also supports legacy GraderRole (MUST_PASS maps to GATE behavior).

The score is always a weighted average of all graders regardless of policy.

must_pass_graders property

Get all must-pass graders (legacy + gate).

score_contributor_graders property

Get all score-contributor graders (legacy).

gate_graders property

Get all gate graders.

warn_graders property

Get all warn graders.

track_graders property

Get all track graders.

grade(transcript, task) async

Grade using policy-aware aggregation.

  1. Run all graders and collect outcomes
  2. Compute weighted score from all graders
  3. Only GATE (or legacy MUST_PASS) failures cause overall failure
  4. WARN failures are recorded in feedback
  5. TRACK results are pure signals

tracelens.GraderConfig

Bases: BaseModel

Configuration for a grader.

tracelens.GraderType

Bases: str, Enum

Types of graders.

tracelens.EvalPolicy

Bases: str, Enum

Three-way policy for how a grader's result affects CI.

GATE: Fails CI on violation. Use for hard constraints (valid JSON, no PII). WARN: Warning by default, configurable CI fail. Use for regressions. TRACK: Never fails CI, just produces signals. Use for quality tracking.

tracelens.GraderRole

Bases: str, Enum

Deprecated: use EvalPolicy instead.

Kept for backward compatibility. MUST_PASS maps to GATE, SCORE_CONTRIBUTOR maps to TRACK.

tracelens.BehaviorContract

Bases: BaseModel

Verifiable contract for agent behavior.

to_graders()

Auto-generate a grader suite from this contract.

Each non-empty contract section produces one grader with an appropriate default policy: - output_schema -> JsonSchemaGrader (GATE) - output_model -> StructuredOutputGrader (GATE) - tools_* -> ToolCallGrader (GATE) - max_latency_ms -> LatencyGrader (WARN) - max_tokens -> TokenBudgetGrader (WARN) - must_include/must_not_include -> ContainsGrader (TRACK) - custom_constraints -> ConstraintGrader (GATE)

Graders — built-in library

See the Grader Library guide for when to reach for each.

tracelens.JsonSchemaGrader

Bases: CodeGrader

Validate transcript.final_output against a JSON Schema.

Uses the jsonschema library for full schema validation.

Default policy: GATE (schema violations block CI).

tracelens.RegexMatchGrader

Bases: CodeGrader

Check whether str(transcript.final_output) matches each regex pattern.

Default policy: TRACK.

tracelens.ContainsGrader

Bases: CodeGrader

Check whether str(transcript.final_output) contains required strings and does not contain forbidden strings.

Default policy: TRACK.

tracelens.ConstraintGrader

Bases: CodeGrader

Evaluate a list of heterogeneous constraints against the agent output.

Supported constraint types: - must_include: str(output) must contain the value - must_not_include: str(output) must not contain the value - numeric_range: output[field] must be within [min, max] - enum: output[field] must be one of the allowed values

Default policy: GATE.

tracelens.StructuredOutputGrader

Bases: CodeGrader

Validate transcript.final_output by parsing it with a Pydantic model.

The model is loaded at grading time via tracelens.execution.registry.load_class from a dotted path such as "myproject.models.ResponseSchema".

Default policy: GATE.

tracelens.LatencyGrader

Bases: CodeGrader

Check that agent execution completes within a time budget.

Pass if transcript.duration_ms <= max_ms. Score: max(0, 1 - duration/max). Default policy: WARN.

tracelens.TokenBudgetGrader

Bases: CodeGrader

Check that agent execution stays within a token budget.

Pass if transcript.total_tokens <= max_tokens. Score: max(0, 1 - total/max). Default policy: WARN.

tracelens.ToolCallGrader

Bases: CodeGrader

Validate tool call compliance against required/allowed/forbidden lists.

  • required_tools: all must be called at least once
  • allowed_tools: if provided, only these tools may be called (allowlist)
  • forbidden_tools: none of these may be called

Pass if all required called AND no unauthorized AND no forbidden. Default policy: GATE.

tracelens.TraceConsistencyGrader

Bases: CodeGrader

Check agent self-consistency in tool usage and trace patterns.

Metrics: - tool_error_rate: fraction of tool calls that returned errors - unused_tool_results: tool calls with non-None results that are not followed by any AGENT_OUTPUT step - phantom_calls: tools called that are not in expected_tools (if provided)

Pass if tool_error_rate < 0.5 and phantom_calls == 0. Score: 1 - tool_error_rate. Default policy: WARN.

tracelens.EventChainVerifier

Bases: CodeGrader

Verifies expected event sequences in transcripts.

Scans transcript steps to match expected events, checks ordering constraints, and scores based on how many events were found.

Example

config = EventChainConfig( expected_events=[ EventExpectation( event_id="search", match_type=EventMatchType.TOOL_NAME, tool_name="search", ), EventExpectation( event_id="analyze", match_type=EventMatchType.TOOL_NAME, tool_name="analyze", after=["search"], ), ], ordering=OrderingMode.PARTIAL, ) verifier = EventChainVerifier("chain_check", config)

compute_metrics(transcript, task)

Scan transcript and match against expected events.

Uses greedy first-match: each step is matched against the first unmatched expectation it satisfies. Order expectations carefully when multiple expectations could match the same step.

determine_pass(metrics, task)

Determine pass/fail from event matching metrics.

tracelens.EventChainConfig

Bases: BaseModel

Configuration for EventChainVerifier.

tracelens.EventExpectation

Bases: BaseModel

An expected event in the transcript.

tracelens.EventMatchType

Bases: StrEnum

How to match a transcript step against an expectation.

tracelens.OrderingMode

Bases: StrEnum

How to enforce ordering of matched events.

Statistics

See pass@k vs pass^k and Statistical Comparison for the concepts.

tracelens.pass_at_k(n, c, k)

Calculate pass@k metric.

Estimates the probability that at least one of k samples passes, given n total samples with c correct. Uses an unbiased estimator.

Parameters:

Name Type Description Default
n int

Total number of samples

required
c int

Number of correct/passing samples

required
k int

Number of samples to consider

required

Returns:

Type Description
float

Probability of at least one pass in k samples (0.0 to 1.0)

Example

pass_at_k(10, 7, 5) 0.9916... # Very likely at least 1 of 5 passes

pass_at_k(10, 1, 5) 0.5 # 50% chance at least 1 of 5 passes

tracelens.pass_at_k_estimator(results_per_task, k)

Compute pass@k across multiple tasks.

For each task, computes pass@k using available samples. Returns the average pass@k across all tasks.

Parameters:

Name Type Description Default
results_per_task dict[str, list[bool]]

Dict mapping task_id to list of pass/fail booleans

required
k int

Number of samples to consider

required

Returns:

Type Description
float

Average pass@k across all tasks

Example

results = { ... "task1": [True, True, False, True, True], ... "task2": [False, True, False, False, True], ... } pass_at_k_estimator(results, k=3) 0.9... # High probability task1 passes, lower for task2

tracelens.PassAtKAnalyzer

Analyzer for pass@k capability metrics.

Computes pass@k for multiple k values and provides confidence intervals.

Example

analyzer = PassAtKAnalyzer(k_values=[1, 3, 5, 10]) results = analyzer.analyze(pass_results_by_task) print(results) # {"pass@1": 0.6, "pass@3": 0.85, "pass@5": 0.95, "pass@10": 0.99}

__init__(k_values=None)

Initialize analyzer with k values to compute.

Parameters:

Name Type Description Default
k_values list[int] | None

List of k values for pass@k. Default: [1, 5, 10]

None

analyze(results_per_task)

Compute pass@k for multiple k values.

Parameters:

Name Type Description Default
results_per_task dict[str, list[bool]]

Dict mapping task_id to list of pass/fail booleans

required

Returns:

Type Description
dict[str, float]

Dict mapping "pass@k" to computed value

compute_confidence_interval(results_per_task, k, confidence=0.95, n_bootstrap=1000)

Compute bootstrap confidence interval for pass@k.

Uses bootstrap resampling over tasks to estimate the confidence interval for pass@k.

Parameters:

Name Type Description Default
results_per_task dict[str, list[bool]]

Dict mapping task_id to list of pass/fail booleans

required
k int

k value for pass@k

required
confidence float

Confidence level (default 0.95 for 95% CI)

0.95
n_bootstrap int

Number of bootstrap samples

1000

Returns:

Type Description
tuple[float, float]

Tuple of (lower_bound, upper_bound)

analyze_with_ci(results_per_task, confidence=0.95, n_bootstrap=1000)

Compute pass@k with confidence intervals.

Parameters:

Name Type Description Default
results_per_task dict[str, list[bool]]

Dict mapping task_id to list of pass/fail booleans

required
confidence float

Confidence level (default 0.95)

0.95
n_bootstrap int

Number of bootstrap samples

1000

Returns:

Type Description
dict[str, dict[str, float]]

Dict mapping "pass@k" to {"value": ..., "lower": ..., "upper": ...}

tracelens.pass_to_k(results, k)

Calculate pass^k (consistency) metric.

Measures the proportion of k-length windows where all samples pass. A sliding window approach is used.

Parameters:

Name Type Description Default
results list[bool]

List of pass/fail booleans from multiple runs

required
k int

Number of consecutive passes required

required

Returns:

Type Description
float

Proportion of k-length windows where all samples pass (0.0 to 1.0)

Example

pass_to_k([True, True, True, True, True], 3) 1.0 # All windows of 3 pass

pass_to_k([True, True, False, True, True], 3) 0.333... # Only 1 of 3 windows passes

tracelens.pass_to_k_estimator(results_per_task, k)

Compute pass^k (consistency) across multiple tasks.

For each task with enough samples, computes pass^k. Returns the average pass^k across all eligible tasks.

Parameters:

Name Type Description Default
results_per_task dict[str, list[bool]]

Dict mapping task_id to list of pass/fail booleans

required
k int

Number of consecutive passes required

required

Returns:

Type Description
float

Average pass^k across all tasks with >= k samples

Example

results = { ... "task1": [True, True, True, True, True], ... "task2": [True, True, False, True, True], ... } pass_to_k_estimator(results, k=3) 0.666... # task1: 1.0, task2: 0.333

tracelens.ConsistencyAnalyzer

Analyzer for pass^k consistency metrics.

Computes pass^k for multiple k values and provides reliability scoring.

Example

analyzer = ConsistencyAnalyzer(k_values=[2, 3, 5]) results = analyzer.analyze(pass_results_by_task) print(results) # {"pass^2": 0.8, "pass^3": 0.6, "pass^5": 0.3}

__init__(k_values=None)

Initialize analyzer with k values to compute.

Parameters:

Name Type Description Default
k_values list[int] | None

List of k values for pass^k. Default: [2, 3, 5]

None

analyze(results_per_task)

Compute pass^k for multiple k values.

Parameters:

Name Type Description Default
results_per_task dict[str, list[bool]]

Dict mapping task_id to list of pass/fail booleans

required

Returns:

Type Description
dict[str, float]

Dict mapping "pass^k" to computed value

compute_reliability_score(results_per_task)

Compute overall reliability score.

Combines pass^k metrics weighted by k to give higher weight to longer consistent runs. A higher score indicates more reliable/consistent performance.

Parameters:

Name Type Description Default
results_per_task dict[str, list[bool]]

Dict mapping task_id to list of pass/fail booleans

required

Returns:

Type Description
float

Weighted reliability score (0.0 to 1.0)

compute_stability_metrics(results_per_task)

Compute additional stability metrics.

Returns:

Type Description
dict[str, float]

Dict with:

dict[str, float]
  • "pass^k" values for each k
dict[str, float]
  • "reliability_score": weighted combination
dict[str, float]
  • "failure_rate": proportion of failed trials
dict[str, float]
  • "longest_streak": average longest passing streak per task

tracelens.bootstrap_ci(values, confidence=0.95, n_bootstrap=10000, statistic='mean', seed=None)

Compute bootstrap confidence interval for a statistic.

Uses percentile bootstrap method for simplicity and robustness.

Parameters:

Name Type Description Default
values list[float] | ndarray

Sample values

required
confidence float

Confidence level (default 0.95 for 95% CI)

0.95
n_bootstrap int

Number of bootstrap samples

10000
statistic str

"mean", "median", or "std"

'mean'
seed int | None

Random seed for reproducibility

None

Returns:

Type Description
tuple[float, float, float]

Tuple of (point_estimate, lower_bound, upper_bound)

tracelens.estimate_metric(values, confidence=0.95, n_bootstrap=10000, seed=None)

Estimate a metric with confidence interval.

Parameters:

Name Type Description Default
values list[float] | ndarray

Sample values

required
confidence float

Confidence level

0.95
n_bootstrap int

Bootstrap samples

10000
seed int | None

Random seed

None

Returns:

Type Description
MetricEstimate

MetricEstimate with mean, std, n, and CI

tracelens.compare_metrics(baseline_values, current_values, confidence=0.95, n_bootstrap=10000, compute_p_value=False, seed=None)

Compare current metrics against baseline with statistical inference.

Computes bootstrap CI for the difference and determines statistical significance.

Parameters:

Name Type Description Default
baseline_values list[float] | ndarray

Baseline sample values

required
current_values list[float] | ndarray

Current sample values

required
confidence float

Confidence level for CI

0.95
n_bootstrap int

Bootstrap samples

10000
compute_p_value bool

Whether to compute permutation p-value

False
seed int | None

Random seed

None

Returns:

Type Description
ComparisonResult

ComparisonResult with full statistical analysis

tracelens.compare_to_baseline_summary(baseline_mean, baseline_std, baseline_n, current_mean, current_std, current_n, confidence=0.95)

Compare metrics when only summary statistics are available.

Uses Welch's t-test approximation for the CI when raw data is not available.

Parameters:

Name Type Description Default
baseline_mean float

Baseline mean

required
baseline_std float

Baseline std deviation

required
baseline_n int

Baseline sample size

required
current_mean float

Current mean

required
current_std float

Current std deviation

required
current_n int

Current sample size

required
confidence float

Confidence level

0.95

Returns:

Type Description
ComparisonResult

ComparisonResult (note: effect size may be less accurate)

tracelens.MetricEstimate dataclass

A metric estimate with uncertainty quantification.

Stores the point estimate along with confidence bounds and sample statistics for research-grade reporting.

se property

Standard error of the mean.

ci_width property

Width of the confidence interval.

to_dict()

Convert to dictionary for serialization.

tracelens.ComparisonResult dataclass

Result of comparing two metric estimates.

Contains statistical test results for determining if the difference is significant.

is_regression property

Check if this represents a statistically significant regression.

A regression occurs when the difference is significantly negative (for higher-is-better metrics).

is_improvement property

Check if this represents a statistically significant improvement.

effect_magnitude property

Classify effect size magnitude (Cohen's conventions).

to_dict()

Convert to dictionary for serialization.

tracelens.LatencyAnalyzer

Analyzes streaming latency from transcript events.

analyze(transcript)

Compute latency metrics for a single transcript.

analyze_batch(transcripts)

Compute aggregate latency metrics across multiple transcripts.

tracelens.LatencyMetrics

Bases: BaseModel

Latency metrics for a single transcript's streaming data.

tracelens.AggregateLatencyMetrics

Bases: BaseModel

Aggregated latency metrics across multiple transcripts.

Baselines & regression detection

See the Baseline Regression Tutorial.

tracelens.BaselineManager

Manages baseline storage and retrieval.

Baselines are stored in a JSON file and can be versioned with git for tracking changes over time.

Example

manager = BaselineManager("baselines/baselines.json")

Get existing baseline

baseline = manager.get_baseline("btc_backtest")

Update baseline

manager.update_baseline("btc_backtest", {"sharpe_ratio": 1.5})

Save changes

manager.save()

__init__(baselines_path)

Initialize the baseline manager.

Parameters:

Name Type Description Default
baselines_path str | Path

Path to the baselines JSON file

required

save()

Save baselines to JSON file.

get_baseline(task_id)

Get baseline for a task.

Parameters:

Name Type Description Default
task_id str

The task identifier

required

Returns:

Type Description
TaskBaseline | None

TaskBaseline if found, None otherwise

set_baseline(baseline)

Set baseline for a task.

Parameters:

Name Type Description Default
baseline TaskBaseline

The task baseline to store

required

update_baseline(task_id, metrics, metric_stds=None, sample_size=1, keep_thresholds=True)

Update or create a baseline with new metric values.

Parameters:

Name Type Description Default
task_id str

The task identifier

required
metrics dict[str, float]

Dict of metric_name -> value

required
metric_stds dict[str, float] | None

Optional dict of metric_name -> std deviation

None
sample_size int

Number of samples used to compute metrics

1
keep_thresholds bool

Keep existing thresholds when updating

True

Returns:

Type Description
TaskBaseline

The updated TaskBaseline

list_tasks()

List all task IDs with baselines.

compare_to_baseline(task_id, current_metrics)

Compare current metrics to baseline.

Parameters:

Name Type Description Default
task_id str

The task identifier

required
current_metrics dict[str, float]

Dict of metric_name -> current value

required

Returns:

Type Description
dict[str, Any]

Dict of metric_name -> comparison dict with:

dict[str, Any]
  • baseline: baseline value
dict[str, Any]
  • current: current value
dict[str, Any]
  • delta: absolute difference
dict[str, Any]
  • relative_change: relative difference
dict[str, Any]
  • regression: True if regression detected
dict[str, Any]
  • z_score: standard score if std available

create_canary_baseline(task_id, metrics, fingerprint, metric_stds=None, sample_size=1, task_name=None)

Create a protected canary baseline.

Canary baselines never auto-update and are tied to a specific DecisionSpec fingerprint. Use for safety-critical metrics.

Parameters:

Name Type Description Default
task_id str

The task identifier

required
metrics dict[str, float]

Dict of metric_name -> value

required
fingerprint str

DecisionSpec fingerprint (required for canary)

required
metric_stds dict[str, float] | None

Optional dict of metric_name -> std deviation

None
sample_size int

Number of samples used to compute metrics

1
task_name str | None

Optional human-readable name

None

Returns:

Type Description
TaskBaseline

The created canary baseline

Raises:

Type Description
ValueError

If fingerprint is not provided

create_capability_baseline(task_id, metrics, metric_stds=None, sample_size=1, task_name=None, promotion_policy=None, fingerprint=None)

Create a capability baseline that can auto-update.

Capability baselines track current agent capability and can be promoted when performance improves.

Parameters:

Name Type Description Default
task_id str

The task identifier

required
metrics dict[str, float]

Dict of metric_name -> value

required
metric_stds dict[str, float] | None

Optional dict of metric_name -> std deviation

None
sample_size int

Number of samples used to compute metrics

1
task_name str | None

Optional human-readable name

None
promotion_policy PromotionPolicy | None

Custom promotion policy (default allows auto-promotion)

None
fingerprint str | None

Optional DecisionSpec fingerprint

None

Returns:

Type Description
TaskBaseline

The created capability baseline

try_promote(task_id, current_metrics, metric_stds=None, sample_size=1, fingerprint=None)

Try to promote a baseline if criteria are met.

This method checks if the baseline can be promoted and performs the promotion if allowed.

Parameters:

Name Type Description Default
task_id str

The task identifier

required
current_metrics dict[str, float]

New metric values

required
metric_stds dict[str, float] | None

Optional standard deviations

None
sample_size int

Number of samples

1
fingerprint str | None

New fingerprint for the promoted baseline

None

Returns:

Type Description
tuple[bool, str]

Tuple of (was_promoted, reason)

force_promote(task_id, current_metrics, metric_stds=None, sample_size=1, reason='manual', fingerprint=None)

Force promote a baseline, bypassing policy checks.

Use this for manual promotions or emergency updates. Even canary baselines can be force-promoted.

Parameters:

Name Type Description Default
task_id str

The task identifier

required
current_metrics dict[str, float]

New metric values

required
metric_stds dict[str, float] | None

Optional standard deviations

None
sample_size int

Number of samples

1
reason str

Reason for the forced promotion

'manual'
fingerprint str | None

New fingerprint for the promoted baseline

None

Returns:

Type Description
TaskBaseline

The promoted baseline

Raises:

Type Description
ValueError

If no baseline exists

list_canary_baselines()

List all canary (protected) baseline task IDs.

list_capability_baselines()

List all capability baseline task IDs.

list_stale_baselines()

List all stale baseline task IDs.

get_baseline_summary(task_id)

Get a summary of a baseline for reporting.

compare_to_baseline_with_ci(task_id, current_values, confidence=0.95, n_bootstrap=10000)

Compare current metrics to baseline with bootstrap CI.

Uses statistical inference to determine if differences are significant. This is the research-grade comparison method.

Parameters:

Name Type Description Default
task_id str

The task identifier

required
current_values dict[str, list[float]]

Dict of metric_name -> list of sample values

required
confidence float

Confidence level (default 0.95 for 95% CI)

0.95
n_bootstrap int

Number of bootstrap samples

10000

Returns:

Type Description
dict[str, Any]

Dict of metric_name -> comparison result with:

dict[str, Any]
  • baseline: MetricEstimate for baseline
dict[str, Any]
  • current: MetricEstimate for current
dict[str, Any]
  • delta: point estimate of difference
dict[str, Any]
  • ci_lower, ci_upper: CI bounds for difference
dict[str, Any]
  • is_significant: True if CI doesn't include 0
dict[str, Any]
  • is_regression: True if significant decline
dict[str, Any]
  • cohens_d: Effect size

tracelens.BaselineType

Bases: str, Enum

Type of baseline determining update semantics.

Protected baseline that never auto-updates.
  • Represents the absolute performance floor
  • Only manually updated after explicit approval
  • Tied to a specific DecisionSpec fingerprint
  • Use for safety-critical or business-critical metrics
Baseline that can auto-update on improvements.
  • Tracks the agent's current capability ceiling
  • Auto-updates when performance improves significantly
  • Maintains version history for rollback
  • Use for tracking progress and catching regressions
Baseline for experimental features.
  • Loose thresholds, high variance expected
  • Auto-updates more aggressively
  • Use during active development

tracelens.PromotionPolicy

Bases: BaseModel

Policy for automatic baseline promotion.

Controls when and how baselines can be automatically updated.

tracelens.MetricBaseline

Bases: BaseModel

Baseline for a single metric.

Stores the expected value, standard deviation, and thresholds for regression detection.

tracelens.TaskBaseline

Bases: BaseModel

Baseline for a complete task.

Groups metric baselines together with metadata about when the baseline was created.

Supports two types of baselines: - CANARY: Protected, never auto-updates (safety floor) - CAPABILITY: Can auto-update on improvements

Example

Create a canary baseline (protected)

baseline = TaskBaseline( task_id="safety_check", baseline_type=BaselineType.CANARY, fingerprint="abc123def456", # Tied to specific config )

Create a capability baseline (can auto-update)

baseline = TaskBaseline( task_id="quality_score", baseline_type=BaselineType.CAPABILITY, promotion_policy=PromotionPolicy(min_improvement_relative=0.05), )

is_canary property

Check if this is a canary (protected) baseline.

allows_auto_promotion property

Check if this baseline allows automatic promotion.

is_stale property

Check if baseline is stale based on max_age_days policy.

in_cooldown property

Check if baseline is in promotion cooldown period.

add_metric(metric_name, value, std=0.0, sample_size=1, absolute_threshold=None, relative_threshold=None, higher_is_better=True)

Add or update a metric baseline.

get_metric(metric_name)

Get a specific metric baseline.

can_promote(current_metrics, sample_size=1)

Check if baseline can be promoted with new metrics.

Parameters:

Name Type Description Default
current_metrics dict[str, float]

New metric values

required
sample_size int

Number of samples used to compute metrics

1

Returns:

Type Description
tuple[bool, str]

Tuple of (can_promote, reason)

promote(current_metrics, metric_stds=None, sample_size=1, reason='auto', fingerprint=None)

Promote baseline to new values.

Archives current version and updates metrics.

Parameters:

Name Type Description Default
current_metrics dict[str, float]

New metric values

required
metric_stds dict[str, float] | None

Optional standard deviations

None
sample_size int

Number of samples

1
reason str

Reason for promotion

'auto'
fingerprint str | None

Optional new fingerprint

None

tracelens.RegressionDetector

Detects regressions between baseline and current results.

Uses statistical tests to determine if observed differences are significant.

Example

detector = RegressionDetector(significance_level=0.05) report = detector.compare(baseline, current_results)

if report.should_block_ci(): sys.exit(1)

__init__(significance_level=0.05, min_delta_percent=5.0, noise_band_absolute=DEFAULT_NOISE_BAND_ABSOLUTE, noise_band_aware=True)

Initialize the detector.

Parameters:

Name Type Description Default
significance_level float

P-value threshold for significance

0.05
min_delta_percent float

Minimum percentage change to consider

5.0
noise_band_absolute float

Absolute delta below which a regression on a pass-rate-style metric (0-1 scale) is considered "within the infra-noise band" when the baseline and current infra configs don't match. Defaults to 0.03 (3pp), following Anthropic's infra-noise study.

DEFAULT_NOISE_BAND_ABSOLUTE
noise_band_aware bool

If True, compare_with_specs() will mark sub-noise-band regressions as within_noise_band when infra configs differ. Set to False to disable the downgrade (always treat every delta as real).

True

compare(baseline, current_results)

Compare current results against baseline.

Parameters:

Name Type Description Default
baseline TaskBaseline

The baseline to compare against

required
current_results list[dict[str, Any]]

List of result dicts, each with metric values

required

Returns:

Type Description
RegressionReport

RegressionReport with detected regressions

compare_multiple(baselines, current_results)

Compare multiple tasks against their baselines.

Parameters:

Name Type Description Default
baselines dict[str, TaskBaseline]

Dict of task_id -> TaskBaseline

required
current_results dict[str, list[dict[str, Any]]]

Dict of task_id -> list of result dicts

required

Returns:

Type Description
dict[str, RegressionReport]

Dict of task_id -> RegressionReport

compare_with_specs(baseline, current_results, baseline_spec=None, current_spec=None)

Compare with DecisionSpec awareness for infra-noise detection.

Wraps compare() and additionally:

  1. Diffs the two DecisionSpecs' infra sections and records any mismatch in report.infra_config_mismatch and report.infra_config_diff.
  2. For each detected regression, if the absolute delta falls within noise_band_absolute (default 3pp) AND the infra configs don't match, mark the regression's within_noise_band flag to True. Those regressions still show up in the report but are excluded from blocking_regressions so a default should_block_ci() call won't gate a merge on ambiguous noise.

When either spec is omitted, this degrades to ordinary compare() behavior with infra_config_mismatch=False.

Parameters:

Name Type Description Default
baseline TaskBaseline

TaskBaseline to compare against.

required
current_results list[dict[str, Any]]

Current run's metric values.

required
baseline_spec DecisionSpec | None

DecisionSpec captured when the baseline was recorded. Optional but enables infra-noise reasoning.

None
current_spec DecisionSpec | None

DecisionSpec for the current run. Optional but enables infra-noise reasoning.

None

Returns:

Type Description
RegressionReport

RegressionReport with infra_config_mismatch,

RegressionReport

infra_config_diff, and per-regression

RegressionReport

within_noise_band annotations populated.

tracelens.RegressionReport

Bases: BaseModel

Complete regression analysis report.

blocking_regressions property

Regressions that should actually block CI.

Excludes any regression marked within_noise_band — those are within the infra-noise uncertainty and shouldn't gate merges until the eval configuration is matched.

should_block_ci(threshold=RegressionSeverity.MODERATE, ignore_noise_band=True)

Determine if CI should be blocked based on severity.

Parameters:

Name Type Description Default
threshold RegressionSeverity

Minimum severity to block. Default: MODERATE

MODERATE
ignore_noise_band bool

If True (default), within-noise-band regressions don't count — a 2pp drop under a mismatched infra config is ambiguous and shouldn't auto-gate merges per Anthropic's infra-noise guidance. Pass False to treat every regression as blocking regardless of noise.

True

Returns:

Type Description
bool

True if CI should be blocked

to_ci_output()

Generate CI-friendly output.

tracelens.RegressionSeverity

Bases: str, Enum

Severity levels for regressions.

tracelens.MetricRegression

Bases: BaseModel

Detected regression in a specific metric.

Reproducibility (DecisionSpec)

See Reproducibility & DecisionSpec.

tracelens.DecisionSpec

Bases: BaseModel

Complete specification of all decision-affecting parameters.

A DecisionSpec is an immutable fingerprint of everything that could affect agent behavior. Two runs with the same DecisionSpec fingerprint should produce statistically similar results (given the same task input).

Example

spec = DecisionSpec( model=ModelConfig( provider="anthropic", model_id="claude-3-opus-20240229", temperature=0.7, ), prompts=PromptSpec.from_prompts( system_prompt="You are a helpful assistant...", prompt_template="Given {context}, do {task}...", ), tools=[ ToolSpec(name="search", version="1.0"), ToolSpec(name="calculator", version="2.1"), ], agent=AgentSpec( agent_name="goal_decomposition", agent_version="1.0.0", ), environment=EnvironmentSpec( git_commit="abc123", framework_version="0.1.0", ), ) print(spec.fingerprint) # "a1b2c3d4..."

fingerprint property

Compute SHA-256 fingerprint of the decision spec.

This fingerprint uniquely identifies the configuration. Two runs with the same fingerprint should produce statistically similar results.

fingerprint_short property

Short version of fingerprint (first 12 characters).

is_compatible_with(other)

Check if two specs are compatible for comparison.

Two specs are compatible if they have the same model and agent, even if prompts or environment differ. This is useful for comparing prompt changes while keeping other factors constant.

diff(other)

Compare two specs and return differences.

Returns dict mapping field names to (self_value, other_value) tuples for fields that differ.

to_summary()

Create human-readable summary.

tracelens.ModelConfig

Bases: BaseModel

Configuration for the LLM model used.

Captures provider, model version, and all decoding parameters that could affect output.

to_hash_dict()

Return dict of fields that affect output (for hashing).

tracelens.PromptSpec

Bases: BaseModel

Specification of prompts used in the agent.

Stores hashes of prompt templates for traceability without storing the full prompts (which may be long or sensitive).

from_prompts(system_prompt=None, prompt_template=None, prompt_version=None, store_full_prompts=False) classmethod

Create PromptSpec from actual prompts.

Parameters:

Name Type Description Default
system_prompt str | None

The system prompt text

None
prompt_template str | None

The prompt template text

None
prompt_version str | None

Optional version identifier

None
store_full_prompts bool

Whether to store full prompts (default False)

False

to_hash_dict()

Return dict of fields for hashing.

tracelens.ToolSpec

Bases: BaseModel

Specification of a tool available to the agent.

Captures tool identity and version for reproducibility.

to_hash_dict()

Return dict for hashing.

tracelens.AgentSpec

Bases: BaseModel

Specification of the agent being evaluated.

Captures agent identity, version, and graph structure.

to_hash_dict()

Return dict for hashing.

tracelens.InfraConfig

Bases: BaseModel

Infrastructure / runtime-environment configuration.

Agentic evals are end-to-end system tests: the runtime environment is part of the problem-solving process. Resource limits, time budgets, and concurrency all influence what strategies an agent can use, which means infrastructure configuration is a first-class experimental variable — not passive scaffolding.

Anthropic's "Quantifying infrastructure noise in agentic coding evals" (Feb 2026) showed that infrastructure config alone can swing Terminal-Bench 2.0 scores by ~6 percentage points, often more than the leaderboard gap between frontier models. Their recommendations are baked into the shape of this spec:

  1. Record both a guaranteed allocation and a separate hard kill threshold, per task (see cpu_guaranteed / cpu_hard_limit and memory_guaranteed_mb / memory_hard_limit_mb). Pinning them to the same value leaves zero headroom for transient spikes and produces spurious infra failures.
  2. Capture the sandboxing provider, because enforcement semantics differ across runtimes (Kubernetes vs. Docker vs. Fly.io vs. bare containers).
  3. Keep observational fields (hostname, container_id, wall_clock_start_utc) out of the fingerprint so two runs with identical resource configs on different hosts collide to the same fingerprint.

See: https://www.anthropic.com/engineering/infrastructure-noise

to_hash_dict()

Return dict of behavior-affecting fields for hashing.

Observational fields (hostname, container_id, wall_clock_start_utc) are intentionally excluded: two runs with identical resource configs on different hosts should collide to the same fingerprint.

tracelens.EnvironmentSpec

Bases: BaseModel

Specification of the execution environment.

Captures build/deployment information for traceability.

to_hash_dict()

Return dict for hashing.

LLM judge providers

tracelens.LLMProvider

Bases: ABC

Abstract base class for LLM providers.

complete(prompt, **kwargs) abstractmethod async

Send a prompt to the LLM and return the text response.

tracelens.InMemoryProvider

Bases: LLMProvider

Testing provider that returns canned responses.

Cycles through the provided responses list and records all prompts.

tracelens.create_provider(model_or_alias, **kwargs)

Create an LLM provider from an alias.

Parameters:

Name Type Description Default
model_or_alias str

Currently only "in-memory" is supported. For real provider calls, subclass LLMProvider directly — tracelens no longer ships a built-in LiteLLM/OpenAI/Anthropic wrapper (see module docstring for the canonical pattern).

required
**kwargs Any

Passed to the provider constructor. For "in-memory", use responses=[...] to seed canned responses.

{}

Returns:

Type Description
LLMProvider

An LLMProvider instance.

Raises:

Type Description
ValueError

If model_or_alias is anything other than "in-memory". The error message points at the subclassing pattern so callers know what to do next.

Reporting

tracelens.ReportGenerator

Generates evaluation reports from TrialBatch results.

Example

gen = ReportGenerator() report = gen.build_report(batch) print(gen.render_markdown(report))

build_report(batch, baseline_manager=None)

Build a ReportData from a TrialBatch.

render_markdown(report)

Render a human-readable markdown report.

render_ci_summary(report)

Render a compact CI-friendly summary.

render_html(report)

Render a self-contained HTML dashboard with inline CSS and SVG charts.

tracelens.ReportData dataclass

Complete evaluation report.

tracelens.TaskSummary dataclass

Per-task summary statistics.