API Reference¶

Auto-generated from docstrings. Everything below is importable from the package root (from tracelens import ...) and covered by the stability policy — submodule paths may move, so import from the root for anything you depend on long-term.

Tip

New here? Read Core Concepts & Glossary first for the mental model, then the User Guide for guided usage. This page is the exhaustive index.

Core models¶

`tracelens.Task` ¶

Bases: BaseModel

Represents a single evaluation task/test case.

A Task defines: - Input data to feed to the agent - Optional expected outputs for validation - Metadata for filtering and categorization - Configuration for execution

Example

task = Task( name="Portfolio website decomposition", input_data={ "goal": "Build a personal portfolio website", "user_context": {"experience": "beginner", "hours_per_week": 15} }, category="programming", tags=["web", "beginner"], )

`matches_filter(tags=None, categories=None, difficulties=None)` ¶

Check if task matches filter criteria.

`tracelens.TaskLoader` ¶

Bases: ABC

Abstract base class for loading tasks from various sources.

`load(source)` `abstractmethod` ¶

Load tasks from a source (file, directory, database, etc.).

`save(tasks, destination)` `abstractmethod` ¶

Save tasks to a destination.

`tracelens.JSONTaskLoader` ¶

Bases: TaskLoader

Load tasks from JSON files.

Supports: - Single file with {"tasks": [...]} or single task object - Directory of JSON files

`load(source)` ¶

Load tasks from JSON file(s).

`save(tasks, destination)` ¶

Save tasks to JSON file.

`tracelens.EvalSet` ¶

Bases: BaseModel

Collection of tasks for evaluation.

An EvalSet groups related tasks together for batch evaluation. It also defines default configuration for all tasks in the set.

Example

eval_set = EvalSet( name="Goal Decomposition Suite v1", tasks=tasks, default_num_runs=5, default_grader_ids=["quality", "personalization"], )

`filter_tasks(tags=None, categories=None, difficulties=None, max_tasks=None)` ¶

Filter tasks by criteria.

`filtered_eval_set(tags=None, categories=None, difficulties=None, max_tasks=None)` ¶

Return a new EvalSet with only tasks matching the filter criteria.

`add_task(task)` ¶

Add a task to the set.

`remove_task(task_id)` ¶

Remove a task by ID. Returns True if found and removed.

`get_task(task_id)` ¶

Get a task by ID.

`tracelens.Transcript` ¶

Bases: BaseModel

Complete record of agent execution for a task.

A Transcript captures everything that happened during an agent's execution, providing a complete audit trail for grading and debugging.

Example

transcript = Transcript( transcript_id=str(uuid.uuid4()), task_id=task.task_id, agent_name="goal_decomposition", started_at=utc_now(), ) transcript.steps.append(step) transcript.final_output = result transcript.completed_at = utc_now()

`has_streaming_data` `property` ¶

Whether this transcript contains streaming events.

`first_token_latency_ms` `property` ¶

Time to first token event in milliseconds, or None if no streaming.

`streaming_duration_ms` `property` ¶

Total streaming duration from first to last event, or None if no streaming.

`streaming_token_count` `property` ¶

Total tokens across all streaming TOKEN events.

`duration_ms` `property` ¶

Calculate execution duration in milliseconds.

`total_tokens` `property` ¶

Calculate total token usage across all steps.

`input_tokens` `property` ¶

Calculate total input tokens.

`output_tokens` `property` ¶

Calculate total output tokens.

`has_errors` `property` ¶

Check if any errors occurred during execution.

`llm_calls_count` `property` ¶

Count LLM calls.

`tool_calls_count` `property` ¶

Count tool calls.

`add_streaming_event(event)` ¶

Append a streaming event to the transcript.

`add_step(step)` ¶

Add a step to the transcript.

`get_tool_calls_by_name(name)` ¶

Get all tool calls with a specific name.

`get_steps_by_type(step_type)` ¶

Get all steps of a specific type.

`to_dict()` ¶

Serialize to a JSON-safe dict with full round-trip fidelity.

`from_dict(data)` `classmethod` ¶

Reconstruct a Transcript from a dict produced by to_dict().

`to_summary()` ¶

Create a summary dict for reporting.

`tracelens.TranscriptStep` ¶

Bases: BaseModel

A single step in agent execution.

Steps are the atomic units of a transcript. Each step represents one action the agent took (LLM call, tool call, etc.).

`is_error` `property` ¶

Check if this step is an error.

`tracelens.StreamingEvent` ¶

Bases: BaseModel

A single event in a streaming response.

Timestamps are in milliseconds since stream start for precise inter-token latency analysis.

`tracelens.StreamingEventType` ¶

Bases: StrEnum

Types of streaming events for real-time token delivery.

`tracelens.Outcome` ¶

Bases: BaseModel

Result of grading a trial.

An Outcome contains: - Primary pass/fail determination - Normalized score (0-1) - Detailed metrics dict (grader-specific) - Optional feedback and reasoning

Example

outcome = Outcome( trial_id="...", grader_id="quality", passed=True, score=0.85, metrics={"specificity": 0.9, "personalization": 0.8}, feedback="Tasks are specific but could use more context references", )

`model_post_init(__context)` ¶

Set grade_level from score if not provided.

`to_summary_dict()` ¶

Return a summary suitable for reporting.

`to_ci_dict()` ¶

Return a compact dict for CI output.

`tracelens.GradeLevel` ¶

Bases: str, Enum

Categorical grade levels for human-readable results.

`from_score(score)` `classmethod` ¶

Convert a normalized score to a grade level.

`tracelens.Trial` ¶

Bases: BaseModel

A single execution of a task.

A Trial tracks: - Which task is being executed - The run index (for pass@k with multiple runs) - The execution transcript - Grading outcomes from all graders - Status and timing

Example

trial = Trial( task_id=task.task_id, run_index=0, total_runs=5, ) trial.status = TrialStatus.RUNNING trial.started_at = utc_now()

... execute agent ...¶

trial.transcript = transcript trial.status = TrialStatus.COMPLETED trial.completed_at = utc_now()

`duration_ms` `property` ¶

Calculate trial duration in milliseconds.

`passed` `property` ¶

Trial passes if ALL outcomes pass.

`aggregate_score` `property` ¶

Average score across all outcomes.

`is_complete` `property` ¶

Check if trial has finished (successfully or not).

`is_successful` `property` ¶

Check if trial completed without errors.

`has_grader_error` `property` ¶

Whether any grader crashed while grading this trial.

Grader crashes are synthesized as failed outcomes so the trial stays conservative (not passed), but they must be counted separately — they measure the eval harness, not the agent.

`is_infra_failure` `property` ¶

Whether this trial failed due to infrastructure, not the agent.

Infra failures (OOM kills, network errors, sandbox terminations) should be counted separately from task failures when interpreting scores — otherwise noisy infra inflates the apparent failure rate.

`fingerprint` `property` ¶

Get decision spec fingerprint from transcript.

`fingerprint_short` `property` ¶

Get short decision spec fingerprint from transcript.

`add_outcome(outcome)` ¶

Add a grading outcome to this trial.

`get_outcome_by_grader(grader_id)` ¶

Get outcome from a specific grader.

`get_metric(metric_name)` ¶

Get a specific metric value from any outcome.

`to_summary_dict()` ¶

Return a summary suitable for reporting.

`to_ci_dict()` ¶

Return a compact dict for CI output.

`tracelens.TrialBatch` ¶

Bases: BaseModel

Collection of trials for batch processing.

Useful for running multiple trials in parallel and aggregating results.

`total_count` `property` ¶

Total number of trials.

`completed_count` `property` ¶

Number of completed trials.

`passed_count` `property` ¶

Number of passed trials.

`pass_rate` `property` ¶

Pass rate across all trials.

`infra_error_count` `property` ¶

Number of trials that failed due to infrastructure issues.

`infra_error_rate` `property` ¶

Fraction of trials that hit infrastructure failures.

Report this alongside pass_rate: Anthropic's infra-noise study found that infra error rates can move by 5+ percentage points purely from resource-configuration changes. A spike in infra_error_rate between two baselines is a strong hint that the regression is noise, not a real capability drop.

`total_input_tokens` `property` ¶

Total input tokens across all trial transcripts.

`total_output_tokens` `property` ¶

Total output tokens across all trial transcripts.

`total_tokens` `property` ¶

Total tokens (input + output) across all trial transcripts.

`grader_error_count` `property` ¶

Number of trials where at least one grader crashed.

`grader_error_rate` `property` ¶

Fraction of trials affected by grader crashes.

Report this alongside pass_rate: a spike here means the grading harness is broken, not that the agent regressed.

`all_complete` `property` ¶

Check if all trials are complete.

`add_trial(trial)` ¶

Add a trial to the batch.

`get_trials_for_task(task_id)` ¶

Get all trials for a specific task.

`to_dict()` ¶

Serialize to a JSON-safe dict with full round-trip fidelity.

`from_dict(data)` `classmethod` ¶

Reconstruct a TrialBatch from a dict produced by to_dict().

`get_pass_results_by_task()` ¶

Get pass/fail results grouped by task.

Returns dict mapping task_id to list of boolean pass results. Useful for computing pass@k.

`tracelens.TrialStatus` ¶

Bases: str, Enum

Status of a trial.

`tracelens.InfraError` ¶

Bases: Exception

Raised by adapters when a failure is known to be infrastructural.

Separating infra failures from task-level failures matters because they mean different things for evaluation scores. A pod killed for exceeding its memory limit tells you nothing about the agent's capability — but it does tell you the eval's resource configuration is too tight (see Anthropic's "Quantifying infrastructure noise in agentic coding evals", which measured infra error rates dropping from 5.8% at strict enforcement to 0.5% uncapped).

When the runner catches this exception (or other known-infra exception types like MemoryError, ConnectionError, TimeoutError from the network stack, or OSError), the trial's status is set to TrialStatus.INFRA_ERROR rather than FAILED, and the infra error rate is surfaced separately in reports so you can decide whether a regression is real or a noise artefact.

Example

class MyAdapter(AgentAdapter): async def run(self, task): try: return await do_work(task) except httpx.ConnectError as exc: raise InfraError(f"upstream API unreachable: {exc}") from exc

Execution¶

`tracelens.AgentAdapter` ¶

Bases: ABC

Abstract base class for agent adapters.

Adapters bridge the evaluation runner to the agent being evaluated. Implement run() to invoke your agent and return a Transcript.

Optionally override setup() and teardown() for lifecycle management. The runner guarantees teardown is called even if run() fails.

Example

class MyAdapter(AgentAdapter): async def setup(self, task: Task) -> None: self.db = await create_test_database()

async def run(self, task: Task) -> Transcript:
    result = await my_agent.invoke(task.input_data)
    transcript = self.start_transcript(task)
    transcript.final_output = result
    transcript.completed_at = utc_now()
    return transcript

async def teardown(self, task: Task, transcript: Transcript | None) -> None:
    await self.db.cleanup()

`setup(task)` `async` ¶

Called before run(). Override for preparation. Default: no-op.

`teardown(task, transcript)` `async` ¶

Called after run(), even on failure. Override for cleanup. Default: no-op.

`run(task)` `abstractmethod` `async` ¶

Run the agent on a task and return a transcript.

`start_transcript(task)` ¶

Helper to create a Transcript with timing started.

`record_error(transcript, error)` ¶

Helper to record an exception in a transcript.

`tracelens.SimpleAdapter` ¶

Bases: AgentAdapter

Wraps any async callable as an AgentAdapter.

Useful for testing and simple single-shot agents that take input_data and return a result.

Example

async def my_fn(input_data: dict) -> dict: return {"answer": "42"}

adapter = SimpleAdapter(my_fn)

`run(task)` `async` ¶

Invoke the wrapped function and build a transcript.

`tracelens.HTTPAPIAdapter` ¶

Bases: AgentAdapter

Adapter that invokes agents via HTTP API calls.

Supports authentication, retry with exponential backoff, and customizable request/response handling.

Example

config = HTTPAdapterConfig( base_url="https://api.example.com", endpoint="/v1/agent/run", auth=AuthConfig(scheme=AuthScheme.BEARER, token="sk-..."), ) adapter = HTTPAPIAdapter(config)

Use with EvaluationRunner¶

`build_request_body(task)` ¶

Build the HTTP request body from a task. Override to customize.

`parse_response_body(data)` ¶

Extract the agent's answer from response JSON. Override to customize.

`run(task)` `async` ¶

Invoke the HTTP agent and return a transcript.

`close()` `async` ¶

Close the underlying httpx client.

`teardown(task, transcript)` `async` ¶

No-op per trial — call close() explicitly when done with all trials.

`tracelens.HTTPAdapterConfig` ¶

Bases: BaseModel

Full configuration for HTTPAPIAdapter.

`tracelens.AuthConfig` ¶

Bases: BaseModel

Authentication configuration for HTTP requests.

`apply_to_headers(headers)` ¶

Apply auth config to a headers dict, returning the updated dict.

`tracelens.AuthScheme` ¶

Bases: StrEnum

Supported authentication schemes.

`tracelens.RetryConfig` ¶

Bases: BaseModel

Retry configuration with exponential backoff.

`tracelens.EvaluationRunner` ¶

Runs evaluations with concurrency control and timeout enforcement.

Example

runner = EvaluationRunner( adapter=my_adapter, graders=[quality_grader, safety_grader], config=RunnerConfig(num_runs=5, max_concurrency=10), ) batch = await runner.run(eval_set) print(f"Pass rate: {batch.pass_rate:.2%}")

`run(eval_set)` `async` ¶

Run all tasks × runs and grade results.

`tracelens.RunnerConfig` `dataclass` ¶

Configuration for the evaluation runner.

Graders — base classes¶

`tracelens.Grader` ¶

Bases: ABC

Abstract base class for all graders.

Graders evaluate agent outputs (Transcripts) and produce Outcomes. Subclass either CodeGrader or LLMGrader for specific implementations.

Example

class MyGrader(CodeGrader): def compute_metrics(self, transcript, task): return {"accuracy": 0.95}

def determine_pass(self, metrics, task):
    return metrics["accuracy"] >= 0.9, metrics["accuracy"]

`grader_type` `abstractmethod` `property` ¶

Return the type of this grader.

`is_deterministic` `property` ¶

Whether this grader produces deterministic results.

`requires_llm` `property` ¶

Whether this grader requires LLM calls.

`requires_human` `property` ¶

Whether this grader requires human input.

`role` `property` ¶

Role of this grader in composite scoring.

`is_must_pass` `property` ¶

Whether this grader must pass for trial to pass.

`is_score_contributor` `property` ¶

Whether this grader contributes to score average.

`policy` `property` ¶

Three-way eval policy for this grader.

`is_gate` `property` ¶

Whether this grader is a gate (fails CI on violation).

`is_warn` `property` ¶

Whether this grader is a warning.

`is_track` `property` ¶

Whether this grader is tracking-only.

`grade(transcript, task)` `abstractmethod` `async` ¶

Grade the transcript for the given task.

Parameters:

Name	Type	Description	Default
`transcript`	`Transcript`	The agent's execution record	required
`task`	`Task`	The task being evaluated	required

Returns:

Type	Description
`Outcome`	An Outcome with pass/fail, score, and metrics

`create_outcome(trial_id, passed, score, metrics=None, feedback=None, **kwargs)` ¶

Helper to create an Outcome with common fields.

`tracelens.CodeGrader` ¶

Bases: Grader

Base class for deterministic code-based graders.

CodeGraders compute metrics from the transcript and then determine pass/fail based on those metrics. They are deterministic - the same input always produces the same output.

Use for: objective metrics (Sharpe ratio, accuracy, latency)

Example

class FinancialGrader(CodeGrader): def compute_metrics(self, transcript, task): returns = transcript.final_output["returns"] return { "sharpe_ratio": calculate_sharpe(returns), "max_drawdown": calculate_max_dd(returns), }

def determine_pass(self, metrics, task):
    passed = metrics["sharpe_ratio"] >= 1.0
    score = min(metrics["sharpe_ratio"] / 2.0, 1.0)
    return passed, score

`compute_metrics(transcript, task)` `abstractmethod` ¶

Compute grading metrics from transcript.

Implement this to extract and calculate metrics from the agent's execution record.

`determine_pass(metrics, task)` `abstractmethod` ¶

Determine pass/fail and score from metrics.

Returns:

Type	Description
`tuple[bool, float]`	Tuple of (passed, score) where score is 0-1 normalized.

`grade(transcript, task)` `async` ¶

Grade by computing metrics and determining pass/fail.

`tracelens.LLMGrader` ¶

Bases: Grader

Base class for LLM-as-judge graders.

LLMGraders use an LLM to evaluate agent outputs. They are non-deterministic and require careful prompt engineering.

Use for: subjective quality (specificity, personalization, clarity)

Example

class QualityGrader(LLMGrader): def build_grading_prompt(self, transcript, task): return f'''Evaluate the quality of this output:

    Score 1-10 on: clarity, completeness, accuracy
    Return JSON: {{"score": X, "feedback": "..."}}'''

def parse_llm_response(self, response, task):
    data = json.loads(response)
    score = data["score"] / 10.0
    passed = score >= 0.7
    return passed, score, {}, data["feedback"]

`build_grading_prompt(transcript, task)` `abstractmethod` ¶

Build the prompt for LLM grading.

Implement this to create a prompt that instructs the LLM how to evaluate the agent's output.

`parse_llm_response(response, task)` `abstractmethod` ¶

Parse LLM response into structured result.

Returns:

Type	Description
`tuple[bool, float, dict[str, float], str]`	Tuple of (passed, score, metrics, feedback)

`grade(transcript, task)` `async` ¶

Grade by calling LLM and parsing response.

Honors GraderConfig: each attempt (LLM call + parse) is bounded by timeout_seconds; transient failures — including malformed responses, which a fresh LLM call often fixes — are retried up to max_retries times with exponential backoff when retry_on_error is set. NotImplementedError (no provider configured) is a setup bug, never retried.

`tracelens.CompositeGrader` ¶

Bases: Grader

Combines multiple graders with policy-aware aggregation.

Supports three policy tiers: - GATE: Any failure causes entire trial to fail (safety, constraints) - WARN: Failures emit warnings but don't fail by default - TRACK: Pure signals, never affect pass/fail

Also supports legacy GraderRole (MUST_PASS maps to GATE behavior).

The score is always a weighted average of all graders regardless of policy.

`must_pass_graders` `property` ¶

Get all must-pass graders (legacy + gate).

`score_contributor_graders` `property` ¶

Get all score-contributor graders (legacy).

`gate_graders` `property` ¶

Get all gate graders.

`warn_graders` `property` ¶

Get all warn graders.

`track_graders` `property` ¶

Get all track graders.

`grade(transcript, task)` `async` ¶

Grade using policy-aware aggregation.

Run all graders and collect outcomes
Compute weighted score from all graders
Only GATE (or legacy MUST_PASS) failures cause overall failure
WARN failures are recorded in feedback
TRACK results are pure signals

`tracelens.GraderConfig` ¶

Bases: BaseModel

Configuration for a grader.

`tracelens.GraderType` ¶

Bases: str, Enum

Types of graders.

`tracelens.EvalPolicy` ¶

Bases: str, Enum

Three-way policy for how a grader's result affects CI.

GATE: Fails CI on violation. Use for hard constraints (valid JSON, no PII). WARN: Warning by default, configurable CI fail. Use for regressions. TRACK: Never fails CI, just produces signals. Use for quality tracking.

`tracelens.GraderRole` ¶

Bases: str, Enum

Deprecated: use EvalPolicy instead.

Kept for backward compatibility. MUST_PASS maps to GATE, SCORE_CONTRIBUTOR maps to TRACK.

`tracelens.BehaviorContract` ¶

Bases: BaseModel

Verifiable contract for agent behavior.

`to_graders()` ¶

Auto-generate a grader suite from this contract.

Each non-empty contract section produces one grader with an appropriate default policy: - output_schema -> JsonSchemaGrader (GATE) - output_model -> StructuredOutputGrader (GATE) - tools_* -> ToolCallGrader (GATE) - max_latency_ms -> LatencyGrader (WARN) - max_tokens -> TokenBudgetGrader (WARN) - must_include/must_not_include -> ContainsGrader (TRACK) - custom_constraints -> ConstraintGrader (GATE)

Graders — built-in library¶

See the Grader Library guide for when to reach for each.

`tracelens.JsonSchemaGrader` ¶

Bases: CodeGrader

Validate transcript.final_output against a JSON Schema.

Uses the jsonschema library for full schema validation.

Default policy: GATE (schema violations block CI).

`tracelens.RegexMatchGrader` ¶

Bases: CodeGrader

Check whether str(transcript.final_output) matches each regex pattern.

Default policy: TRACK.

`tracelens.ContainsGrader` ¶

Bases: CodeGrader

Check whether str(transcript.final_output) contains required strings and does not contain forbidden strings.

Default policy: TRACK.

`tracelens.ConstraintGrader` ¶

Bases: CodeGrader

Evaluate a list of heterogeneous constraints against the agent output.

Supported constraint types: - must_include: str(output) must contain the value - must_not_include: str(output) must not contain the value - numeric_range: output[field] must be within [min, max] - enum: output[field] must be one of the allowed values

Default policy: GATE.

`tracelens.StructuredOutputGrader` ¶

Bases: CodeGrader

Validate transcript.final_output by parsing it with a Pydantic model.

The model is loaded at grading time via tracelens.execution.registry.load_class from a dotted path such as "myproject.models.ResponseSchema".

Default policy: GATE.

`tracelens.LatencyGrader` ¶

Bases: CodeGrader

Check that agent execution completes within a time budget.

Pass if transcript.duration_ms <= max_ms. Score: max(0, 1 - duration/max). Default policy: WARN.

`tracelens.TokenBudgetGrader` ¶

Bases: CodeGrader

Check that agent execution stays within a token budget.

Pass if transcript.total_tokens <= max_tokens. Score: max(0, 1 - total/max). Default policy: WARN.

`tracelens.ToolCallGrader` ¶

Bases: CodeGrader

Validate tool call compliance against required/allowed/forbidden lists.

required_tools: all must be called at least once
allowed_tools: if provided, only these tools may be called (allowlist)
forbidden_tools: none of these may be called

Pass if all required called AND no unauthorized AND no forbidden. Default policy: GATE.

`tracelens.TraceConsistencyGrader` ¶

Bases: CodeGrader

Check agent self-consistency in tool usage and trace patterns.

Metrics: - tool_error_rate: fraction of tool calls that returned errors - unused_tool_results: tool calls with non-None results that are not followed by any AGENT_OUTPUT step - phantom_calls: tools called that are not in expected_tools (if provided)

Pass if tool_error_rate < 0.5 and phantom_calls == 0. Score: 1 - tool_error_rate. Default policy: WARN.

`tracelens.EventChainVerifier` ¶

Bases: CodeGrader

Verifies expected event sequences in transcripts.

Scans transcript steps to match expected events, checks ordering constraints, and scores based on how many events were found.

Example

config = EventChainConfig( expected_events=[ EventExpectation( event_id="search", match_type=EventMatchType.TOOL_NAME, tool_name="search", ), EventExpectation( event_id="analyze", match_type=EventMatchType.TOOL_NAME, tool_name="analyze", after=["search"], ), ], ordering=OrderingMode.PARTIAL, ) verifier = EventChainVerifier("chain_check", config)

`compute_metrics(transcript, task)` ¶

Scan transcript and match against expected events.

Uses greedy first-match: each step is matched against the first unmatched expectation it satisfies. Order expectations carefully when multiple expectations could match the same step.

`determine_pass(metrics, task)` ¶

Determine pass/fail from event matching metrics.

`tracelens.EventChainConfig` ¶

Bases: BaseModel

Configuration for EventChainVerifier.

`tracelens.EventExpectation` ¶

Bases: BaseModel

An expected event in the transcript.

`tracelens.EventMatchType` ¶

Bases: StrEnum

How to match a transcript step against an expectation.

`tracelens.OrderingMode` ¶

Bases: StrEnum

How to enforce ordering of matched events.

Statistics¶

See pass@k vs pass^k and Statistical Comparison for the concepts.

`tracelens.pass_at_k(n, c, k)` ¶

Calculate pass@k metric.

Estimates the probability that at least one of k samples passes, given n total samples with c correct. Uses an unbiased estimator.

Parameters:

Name	Type	Description	Default
`n`	`int`	Total number of samples	required
`c`	`int`	Number of correct/passing samples	required
`k`	`int`	Number of samples to consider	required

Returns:

Type	Description
`float`	Probability of at least one pass in k samples (0.0 to 1.0)

Example

pass_at_k(10, 7, 5) 0.9916... # Very likely at least 1 of 5 passes

pass_at_k(10, 1, 5) 0.5 # 50% chance at least 1 of 5 passes

`tracelens.pass_at_k_estimator(results_per_task, k)` ¶

Compute pass@k across multiple tasks.

For each task, computes pass@k using available samples. Returns the average pass@k across all tasks.

Parameters:

Name	Type	Description	Default
`results_per_task`	`dict[str, list[bool]]`	Dict mapping task_id to list of pass/fail booleans	required
`k`	`int`	Number of samples to consider	required

Returns:

Type	Description
`float`	Average pass@k across all tasks

Example

results = { ... "task1": [True, True, False, True, True], ... "task2": [False, True, False, False, True], ... } pass_at_k_estimator(results, k=3) 0.9... # High probability task1 passes, lower for task2

`tracelens.PassAtKAnalyzer` ¶

Analyzer for pass@k capability metrics.

Computes pass@k for multiple k values and provides confidence intervals.

Example

analyzer = PassAtKAnalyzer(k_values=[1, 3, 5, 10]) results = analyzer.analyze(pass_results_by_task) print(results) # {"pass@1": 0.6, "pass@3": 0.85, "pass@5": 0.95, "pass@10": 0.99}

`init(k_values=None)` ¶

Initialize analyzer with k values to compute.

Parameters:

Name	Type	Description	Default
`k_values`	`list[int] \| None`	List of k values for pass@k. Default: [1, 5, 10]	`None`

`analyze(results_per_task)` ¶

Compute pass@k for multiple k values.

Parameters:

Name	Type	Description	Default
`results_per_task`	`dict[str, list[bool]]`	Dict mapping task_id to list of pass/fail booleans	required

Returns:

Type	Description
`dict[str, float]`	Dict mapping "pass@k" to computed value

`compute_confidence_interval(results_per_task, k, confidence=0.95, n_bootstrap=1000)` ¶

Compute bootstrap confidence interval for pass@k.

Uses bootstrap resampling over tasks to estimate the confidence interval for pass@k.

Parameters:

Name	Type	Description	Default
`results_per_task`	`dict[str, list[bool]]`	Dict mapping task_id to list of pass/fail booleans	required
`k`	`int`	k value for pass@k	required
`confidence`	`float`	Confidence level (default 0.95 for 95% CI)	`0.95`
`n_bootstrap`	`int`	Number of bootstrap samples	`1000`

Returns:

Type	Description
`tuple[float, float]`	Tuple of (lower_bound, upper_bound)

`analyze_with_ci(results_per_task, confidence=0.95, n_bootstrap=1000)` ¶

Compute pass@k with confidence intervals.

Parameters:

Name	Type	Description	Default
`results_per_task`	`dict[str, list[bool]]`	Dict mapping task_id to list of pass/fail booleans	required
`confidence`	`float`	Confidence level (default 0.95)	`0.95`
`n_bootstrap`	`int`	Number of bootstrap samples	`1000`

Returns:

Type	Description
`dict[str, dict[str, float]]`	Dict mapping "pass@k" to {"value": ..., "lower": ..., "upper": ...}

`tracelens.pass_to_k(results, k)` ¶

Calculate pass^k (consistency) metric.

Measures the proportion of k-length windows where all samples pass. A sliding window approach is used.

Parameters:

Name	Type	Description	Default
`results`	`list[bool]`	List of pass/fail booleans from multiple runs	required
`k`	`int`	Number of consecutive passes required	required

Returns:

Type	Description
`float`	Proportion of k-length windows where all samples pass (0.0 to 1.0)

Example

pass_to_k([True, True, True, True, True], 3) 1.0 # All windows of 3 pass

pass_to_k([True, True, False, True, True], 3) 0.333... # Only 1 of 3 windows passes

`tracelens.pass_to_k_estimator(results_per_task, k)` ¶

Compute pass^k (consistency) across multiple tasks.

For each task with enough samples, computes pass^k. Returns the average pass^k across all eligible tasks.

Parameters:

Name	Type	Description	Default
`results_per_task`	`dict[str, list[bool]]`	Dict mapping task_id to list of pass/fail booleans	required
`k`	`int`	Number of consecutive passes required	required

Returns:

Type	Description
`float`	Average pass^k across all tasks with >= k samples

Example

results = { ... "task1": [True, True, True, True, True], ... "task2": [True, True, False, True, True], ... } pass_to_k_estimator(results, k=3) 0.666... # task1: 1.0, task2: 0.333

`tracelens.ConsistencyAnalyzer` ¶

Analyzer for pass^k consistency metrics.

Computes pass^k for multiple k values and provides reliability scoring.

Example

analyzer = ConsistencyAnalyzer(k_values=[2, 3, 5]) results = analyzer.analyze(pass_results_by_task) print(results) # {"pass^2": 0.8, "pass^3": 0.6, "pass^5": 0.3}

`init(k_values=None)` ¶

Initialize analyzer with k values to compute.

Parameters:

Name	Type	Description	Default
`k_values`	`list[int] \| None`	List of k values for pass^k. Default: [2, 3, 5]	`None`

`analyze(results_per_task)` ¶

Compute pass^k for multiple k values.

Parameters:

Name	Type	Description	Default
`results_per_task`	`dict[str, list[bool]]`	Dict mapping task_id to list of pass/fail booleans	required

Returns:

Type	Description
`dict[str, float]`	Dict mapping "pass^k" to computed value

`compute_reliability_score(results_per_task)` ¶

Compute overall reliability score.

Combines pass^k metrics weighted by k to give higher weight to longer consistent runs. A higher score indicates more reliable/consistent performance.

Parameters:

Name	Type	Description	Default
`results_per_task`	`dict[str, list[bool]]`	Dict mapping task_id to list of pass/fail booleans	required

Returns:

Type	Description
`float`	Weighted reliability score (0.0 to 1.0)

`compute_stability_metrics(results_per_task)` ¶

Compute additional stability metrics.

Returns:

Type	Description
`dict[str, float]`	Dict with:
`dict[str, float]`	"pass^k" values for each k
`dict[str, float]`	"reliability_score": weighted combination
`dict[str, float]`	"failure_rate": proportion of failed trials
`dict[str, float]`	"longest_streak": average longest passing streak per task

`tracelens.bootstrap_ci(values, confidence=0.95, n_bootstrap=10000, statistic='mean', seed=None)` ¶

Compute bootstrap confidence interval for a statistic.

Uses percentile bootstrap method for simplicity and robustness.

Parameters:

Name	Type	Description	Default
`values`	`list[float] \| ndarray`	Sample values	required
`confidence`	`float`	Confidence level (default 0.95 for 95% CI)	`0.95`
`n_bootstrap`	`int`	Number of bootstrap samples	`10000`
`statistic`	`str`	"mean", "median", or "std"	`'mean'`
`seed`	`int \| None`	Random seed for reproducibility	`None`

Returns:

Type	Description
`tuple[float, float, float]`	Tuple of (point_estimate, lower_bound, upper_bound)

`tracelens.estimate_metric(values, confidence=0.95, n_bootstrap=10000, seed=None)` ¶

Estimate a metric with confidence interval.

Parameters:

Name	Type	Description	Default
`values`	`list[float] \| ndarray`	Sample values	required
`confidence`	`float`	Confidence level	`0.95`
`n_bootstrap`	`int`	Bootstrap samples	`10000`
`seed`	`int \| None`	Random seed	`None`

Returns:

Type	Description
`MetricEstimate`	MetricEstimate with mean, std, n, and CI

`tracelens.compare_metrics(baseline_values, current_values, confidence=0.95, n_bootstrap=10000, compute_p_value=False, seed=None)` ¶

Compare current metrics against baseline with statistical inference.

Computes bootstrap CI for the difference and determines statistical significance.

Parameters:

Name	Type	Description	Default
`baseline_values`	`list[float] \| ndarray`	Baseline sample values	required
`current_values`	`list[float] \| ndarray`	Current sample values	required
`confidence`	`float`	Confidence level for CI	`0.95`
`n_bootstrap`	`int`	Bootstrap samples	`10000`
`compute_p_value`	`bool`	Whether to compute permutation p-value	`False`
`seed`	`int \| None`	Random seed	`None`

Returns:

Type	Description
`ComparisonResult`	ComparisonResult with full statistical analysis

`tracelens.compare_to_baseline_summary(baseline_mean, baseline_std, baseline_n, current_mean, current_std, current_n, confidence=0.95)` ¶

Compare metrics when only summary statistics are available.

Uses Welch's t-test approximation for the CI when raw data is not available.

Parameters:

Name	Type	Description	Default
`baseline_mean`	`float`	Baseline mean	required
`baseline_std`	`float`	Baseline std deviation	required
`baseline_n`	`int`	Baseline sample size	required
`current_mean`	`float`	Current mean	required
`current_std`	`float`	Current std deviation	required
`current_n`	`int`	Current sample size	required
`confidence`	`float`	Confidence level	`0.95`

Returns:

Type	Description
`ComparisonResult`	ComparisonResult (note: effect size may be less accurate)

`tracelens.MetricEstimate` `dataclass` ¶

A metric estimate with uncertainty quantification.

Stores the point estimate along with confidence bounds and sample statistics for research-grade reporting.

`se` `property` ¶

Standard error of the mean.

`ci_width` `property` ¶

Width of the confidence interval.

`to_dict()` ¶

Convert to dictionary for serialization.

`tracelens.ComparisonResult` `dataclass` ¶

Result of comparing two metric estimates.

Contains statistical test results for determining if the difference is significant.

`is_regression` `property` ¶

Check if this represents a statistically significant regression.

A regression occurs when the difference is significantly negative (for higher-is-better metrics).

`is_improvement` `property` ¶

Check if this represents a statistically significant improvement.

`effect_magnitude` `property` ¶

Classify effect size magnitude (Cohen's conventions).

`to_dict()` ¶

Convert to dictionary for serialization.

`tracelens.LatencyAnalyzer` ¶

Analyzes streaming latency from transcript events.

`analyze(transcript)` ¶

Compute latency metrics for a single transcript.

`analyze_batch(transcripts)` ¶

Compute aggregate latency metrics across multiple transcripts.

`tracelens.LatencyMetrics` ¶

Bases: BaseModel

Latency metrics for a single transcript's streaming data.

`tracelens.AggregateLatencyMetrics` ¶

Bases: BaseModel

Aggregated latency metrics across multiple transcripts.

Baselines & regression detection¶

See the Baseline Regression Tutorial.

`tracelens.BaselineManager` ¶

Manages baseline storage and retrieval.

Baselines are stored in a JSON file and can be versioned with git for tracking changes over time.

Example

manager = BaselineManager("baselines/baselines.json")

Get existing baseline¶

baseline = manager.get_baseline("btc_backtest")

Update baseline¶

manager.update_baseline("btc_backtest", {"sharpe_ratio": 1.5})

Save changes¶

manager.save()

`init(baselines_path)` ¶

Initialize the baseline manager.

Parameters:

Name	Type	Description	Default
`baselines_path`	`str \| Path`	Path to the baselines JSON file	required

`save()` ¶

Save baselines to JSON file.

`get_baseline(task_id)` ¶

Get baseline for a task.

Parameters:

Name	Type	Description	Default
`task_id`	`str`	The task identifier	required

Returns:

Type	Description
`TaskBaseline \| None`	TaskBaseline if found, None otherwise

`set_baseline(baseline)` ¶

Set baseline for a task.

Parameters:

Name	Type	Description	Default
`baseline`	`TaskBaseline`	The task baseline to store	required

`update_baseline(task_id, metrics, metric_stds=None, sample_size=1, keep_thresholds=True)` ¶

Update or create a baseline with new metric values.

Parameters:

Name	Type	Description	Default
`task_id`	`str`	The task identifier	required
`metrics`	`dict[str, float]`	Dict of metric_name -> value	required
`metric_stds`	`dict[str, float] \| None`	Optional dict of metric_name -> std deviation	`None`
`sample_size`	`int`	Number of samples used to compute metrics	`1`
`keep_thresholds`	`bool`	Keep existing thresholds when updating	`True`

Returns:

Type	Description
`TaskBaseline`	The updated TaskBaseline

`list_tasks()` ¶

List all task IDs with baselines.

`compare_to_baseline(task_id, current_metrics)` ¶

Compare current metrics to baseline.

Parameters:

Name	Type	Description	Default
`task_id`	`str`	The task identifier	required
`current_metrics`	`dict[str, float]`	Dict of metric_name -> current value	required

Returns:

Type	Description
`dict[str, Any]`	Dict of metric_name -> comparison dict with:
`dict[str, Any]`	baseline: baseline value
`dict[str, Any]`	current: current value
`dict[str, Any]`	delta: absolute difference
`dict[str, Any]`	relative_change: relative difference
`dict[str, Any]`	regression: True if regression detected
`dict[str, Any]`	z_score: standard score if std available

`create_canary_baseline(task_id, metrics, fingerprint, metric_stds=None, sample_size=1, task_name=None)` ¶

Create a protected canary baseline.

Canary baselines never auto-update and are tied to a specific DecisionSpec fingerprint. Use for safety-critical metrics.

Parameters:

Name	Type	Description	Default
`task_id`	`str`	The task identifier	required
`metrics`	`dict[str, float]`	Dict of metric_name -> value	required
`fingerprint`	`str`	DecisionSpec fingerprint (required for canary)	required
`metric_stds`	`dict[str, float] \| None`	Optional dict of metric_name -> std deviation	`None`
`sample_size`	`int`	Number of samples used to compute metrics	`1`
`task_name`	`str \| None`	Optional human-readable name	`None`

Returns:

Type	Description
`TaskBaseline`	The created canary baseline

Raises:

Type	Description
`ValueError`	If fingerprint is not provided

`create_capability_baseline(task_id, metrics, metric_stds=None, sample_size=1, task_name=None, promotion_policy=None, fingerprint=None)` ¶

Create a capability baseline that can auto-update.

Capability baselines track current agent capability and can be promoted when performance improves.

Parameters:

Name	Type	Description	Default
`task_id`	`str`	The task identifier	required
`metrics`	`dict[str, float]`	Dict of metric_name -> value	required
`metric_stds`	`dict[str, float] \| None`	Optional dict of metric_name -> std deviation	`None`
`sample_size`	`int`	Number of samples used to compute metrics	`1`
`task_name`	`str \| None`	Optional human-readable name	`None`
`promotion_policy`	`PromotionPolicy \| None`	Custom promotion policy (default allows auto-promotion)	`None`
`fingerprint`	`str \| None`	Optional DecisionSpec fingerprint	`None`

Returns:

Type	Description
`TaskBaseline`	The created capability baseline

`try_promote(task_id, current_metrics, metric_stds=None, sample_size=1, fingerprint=None)` ¶

Try to promote a baseline if criteria are met.

This method checks if the baseline can be promoted and performs the promotion if allowed.

Parameters:

Name	Type	Description	Default
`task_id`	`str`	The task identifier	required
`current_metrics`	`dict[str, float]`	New metric values	required
`metric_stds`	`dict[str, float] \| None`	Optional standard deviations	`None`
`sample_size`	`int`	Number of samples	`1`
`fingerprint`	`str \| None`	New fingerprint for the promoted baseline	`None`

Returns:

Type	Description
`tuple[bool, str]`	Tuple of (was_promoted, reason)

`force_promote(task_id, current_metrics, metric_stds=None, sample_size=1, reason='manual', fingerprint=None)` ¶

Force promote a baseline, bypassing policy checks.

Use this for manual promotions or emergency updates. Even canary baselines can be force-promoted.

Parameters:

Name	Type	Description	Default
`task_id`	`str`	The task identifier	required
`current_metrics`	`dict[str, float]`	New metric values	required
`metric_stds`	`dict[str, float] \| None`	Optional standard deviations	`None`
`sample_size`	`int`	Number of samples	`1`
`reason`	`str`	Reason for the forced promotion	`'manual'`
`fingerprint`	`str \| None`	New fingerprint for the promoted baseline	`None`

Returns:

Type	Description
`TaskBaseline`	The promoted baseline

Raises:

Type	Description
`ValueError`	If no baseline exists

`list_canary_baselines()` ¶

List all canary (protected) baseline task IDs.

`list_capability_baselines()` ¶

List all capability baseline task IDs.

`list_stale_baselines()` ¶

List all stale baseline task IDs.

`get_baseline_summary(task_id)` ¶

Get a summary of a baseline for reporting.

`compare_to_baseline_with_ci(task_id, current_values, confidence=0.95, n_bootstrap=10000)` ¶

Compare current metrics to baseline with bootstrap CI.

Uses statistical inference to determine if differences are significant. This is the research-grade comparison method.

Parameters:

Name	Type	Description	Default
`task_id`	`str`	The task identifier	required
`current_values`	`dict[str, list[float]]`	Dict of metric_name -> list of sample values	required
`confidence`	`float`	Confidence level (default 0.95 for 95% CI)	`0.95`
`n_bootstrap`	`int`	Number of bootstrap samples	`10000`

Returns:

Type	Description
`dict[str, Any]`	Dict of metric_name -> comparison result with:
`dict[str, Any]`	baseline: MetricEstimate for baseline
`dict[str, Any]`	current: MetricEstimate for current
`dict[str, Any]`	delta: point estimate of difference
`dict[str, Any]`	ci_lower, ci_upper: CI bounds for difference
`dict[str, Any]`	is_significant: True if CI doesn't include 0
`dict[str, Any]`	is_regression: True if significant decline
`dict[str, Any]`	cohens_d: Effect size

`tracelens.BaselineType` ¶

Bases: str, Enum

Type of baseline determining update semantics.

Protected baseline that never auto-updates.

Represents the absolute performance floor
Only manually updated after explicit approval
Tied to a specific DecisionSpec fingerprint
Use for safety-critical or business-critical metrics

Baseline that can auto-update on improvements.

Tracks the agent's current capability ceiling
Auto-updates when performance improves significantly
Maintains version history for rollback
Use for tracking progress and catching regressions

Baseline for experimental features.

Loose thresholds, high variance expected
Auto-updates more aggressively
Use during active development

`tracelens.PromotionPolicy` ¶

Bases: BaseModel

Policy for automatic baseline promotion.

Controls when and how baselines can be automatically updated.

`tracelens.MetricBaseline` ¶

Bases: BaseModel

Baseline for a single metric.

Stores the expected value, standard deviation, and thresholds for regression detection.

`tracelens.TaskBaseline` ¶

Bases: BaseModel

Baseline for a complete task.

Groups metric baselines together with metadata about when the baseline was created.

Supports two types of baselines: - CANARY: Protected, never auto-updates (safety floor) - CAPABILITY: Can auto-update on improvements

Example

Create a canary baseline (protected)¶

baseline = TaskBaseline( task_id="safety_check", baseline_type=BaselineType.CANARY, fingerprint="abc123def456", # Tied to specific config )

Create a capability baseline (can auto-update)¶

baseline = TaskBaseline( task_id="quality_score", baseline_type=BaselineType.CAPABILITY, promotion_policy=PromotionPolicy(min_improvement_relative=0.05), )

`is_canary` `property` ¶

Check if this is a canary (protected) baseline.

`allows_auto_promotion` `property` ¶

Check if this baseline allows automatic promotion.

`is_stale` `property` ¶

Check if baseline is stale based on max_age_days policy.

`in_cooldown` `property` ¶

Check if baseline is in promotion cooldown period.

`add_metric(metric_name, value, std=0.0, sample_size=1, absolute_threshold=None, relative_threshold=None, higher_is_better=True)` ¶

Add or update a metric baseline.

`get_metric(metric_name)` ¶

Get a specific metric baseline.

`can_promote(current_metrics, sample_size=1)` ¶

Check if baseline can be promoted with new metrics.

Parameters:

Name	Type	Description	Default
`current_metrics`	`dict[str, float]`	New metric values	required
`sample_size`	`int`	Number of samples used to compute metrics	`1`

Returns:

Type	Description
`tuple[bool, str]`	Tuple of (can_promote, reason)

`promote(current_metrics, metric_stds=None, sample_size=1, reason='auto', fingerprint=None)` ¶

Promote baseline to new values.

Archives current version and updates metrics.

Parameters:

Name	Type	Description	Default
`current_metrics`	`dict[str, float]`	New metric values	required
`metric_stds`	`dict[str, float] \| None`	Optional standard deviations	`None`
`sample_size`	`int`	Number of samples	`1`
`reason`	`str`	Reason for promotion	`'auto'`
`fingerprint`	`str \| None`	Optional new fingerprint	`None`

`tracelens.RegressionDetector` ¶

Detects regressions between baseline and current results.

Uses statistical tests to determine if observed differences are significant.

Example

detector = RegressionDetector(significance_level=0.05) report = detector.compare(baseline, current_results)

if report.should_block_ci(): sys.exit(1)

`init(significance_level=0.05, min_delta_percent=5.0, noise_band_absolute=DEFAULT_NOISE_BAND_ABSOLUTE, noise_band_aware=True)` ¶

Initialize the detector.

Parameters:

Name	Type	Description	Default
`significance_level`	`float`	P-value threshold for significance	`0.05`
`min_delta_percent`	`float`	Minimum percentage change to consider	`5.0`
`noise_band_absolute`	`float`	Absolute delta below which a regression on a pass-rate-style metric (0-1 scale) is considered "within the infra-noise band" when the baseline and current infra configs don't match. Defaults to 0.03 (3pp), following Anthropic's infra-noise study.	`DEFAULT_NOISE_BAND_ABSOLUTE`
`noise_band_aware`	`bool`	If True, compare_with_specs() will mark sub-noise-band regressions as `within_noise_band` when infra configs differ. Set to False to disable the downgrade (always treat every delta as real).	`True`

`compare(baseline, current_results)` ¶

Compare current results against baseline.

Parameters:

Name	Type	Description	Default
`baseline`	`TaskBaseline`	The baseline to compare against	required
`current_results`	`list[dict[str, Any]]`	List of result dicts, each with metric values	required

Returns:

Type	Description
`RegressionReport`	RegressionReport with detected regressions

`compare_multiple(baselines, current_results)` ¶

Compare multiple tasks against their baselines.

Parameters:

Name	Type	Description	Default
`baselines`	`dict[str, TaskBaseline]`	Dict of task_id -> TaskBaseline	required
`current_results`	`dict[str, list[dict[str, Any]]]`	Dict of task_id -> list of result dicts	required

Returns:

Type	Description
`dict[str, RegressionReport]`	Dict of task_id -> RegressionReport

`compare_with_specs(baseline, current_results, baseline_spec=None, current_spec=None)` ¶

Compare with DecisionSpec awareness for infra-noise detection.

Wraps compare() and additionally:

Diffs the two DecisionSpecs' infra sections and records any mismatch in report.infra_config_mismatch and report.infra_config_diff.
For each detected regression, if the absolute delta falls within noise_band_absolute (default 3pp) AND the infra configs don't match, mark the regression's within_noise_band flag to True. Those regressions still show up in the report but are excluded from blocking_regressions so a default should_block_ci() call won't gate a merge on ambiguous noise.

When either spec is omitted, this degrades to ordinary compare() behavior with infra_config_mismatch=False.

Parameters:

Name	Type	Description	Default
`baseline`	`TaskBaseline`	TaskBaseline to compare against.	required
`current_results`	`list[dict[str, Any]]`	Current run's metric values.	required
`baseline_spec`	`DecisionSpec \| None`	DecisionSpec captured when the baseline was recorded. Optional but enables infra-noise reasoning.	`None`
`current_spec`	`DecisionSpec \| None`	DecisionSpec for the current run. Optional but enables infra-noise reasoning.	`None`

Returns:

Type	Description
`RegressionReport`	RegressionReport with `infra_config_mismatch`,
`RegressionReport`	`infra_config_diff`, and per-regression
`RegressionReport`	`within_noise_band` annotations populated.

`tracelens.RegressionReport` ¶

Bases: BaseModel

Complete regression analysis report.

`blocking_regressions` `property` ¶

Regressions that should actually block CI.

Excludes any regression marked within_noise_band — those are within the infra-noise uncertainty and shouldn't gate merges until the eval configuration is matched.

`should_block_ci(threshold=RegressionSeverity.MODERATE, ignore_noise_band=True)` ¶

Determine if CI should be blocked based on severity.

Parameters:

Name	Type	Description	Default
`threshold`	`RegressionSeverity`	Minimum severity to block. Default: MODERATE	`MODERATE`
`ignore_noise_band`	`bool`	If True (default), within-noise-band regressions don't count — a 2pp drop under a mismatched infra config is ambiguous and shouldn't auto-gate merges per Anthropic's infra-noise guidance. Pass False to treat every regression as blocking regardless of noise.	`True`

Returns:

Type	Description
`bool`	True if CI should be blocked

`to_ci_output()` ¶

Generate CI-friendly output.

`tracelens.RegressionSeverity` ¶

Bases: str, Enum

Severity levels for regressions.

`tracelens.MetricRegression` ¶

Bases: BaseModel

Detected regression in a specific metric.

Reproducibility (DecisionSpec)¶

See Reproducibility & DecisionSpec.

`tracelens.DecisionSpec` ¶

Bases: BaseModel

Complete specification of all decision-affecting parameters.

A DecisionSpec is an immutable fingerprint of everything that could affect agent behavior. Two runs with the same DecisionSpec fingerprint should produce statistically similar results (given the same task input).

Example

spec = DecisionSpec( model=ModelConfig( provider="anthropic", model_id="claude-3-opus-20240229", temperature=0.7, ), prompts=PromptSpec.from_prompts( system_prompt="You are a helpful assistant...", prompt_template="Given {context}, do {task}...", ), tools=[ ToolSpec(name="search", version="1.0"), ToolSpec(name="calculator", version="2.1"), ], agent=AgentSpec( agent_name="goal_decomposition", agent_version="1.0.0", ), environment=EnvironmentSpec( git_commit="abc123", framework_version="0.1.0", ), ) print(spec.fingerprint) # "a1b2c3d4..."

`fingerprint` `property` ¶

Compute SHA-256 fingerprint of the decision spec.

This fingerprint uniquely identifies the configuration. Two runs with the same fingerprint should produce statistically similar results.

`fingerprint_short` `property` ¶

Short version of fingerprint (first 12 characters).

`is_compatible_with(other)` ¶

Check if two specs are compatible for comparison.

Two specs are compatible if they have the same model and agent, even if prompts or environment differ. This is useful for comparing prompt changes while keeping other factors constant.

`diff(other)` ¶

Compare two specs and return differences.

Returns dict mapping field names to (self_value, other_value) tuples for fields that differ.

`to_summary()` ¶

Create human-readable summary.

`tracelens.ModelConfig` ¶

Bases: BaseModel

Configuration for the LLM model used.

Captures provider, model version, and all decoding parameters that could affect output.

`to_hash_dict()` ¶

Return dict of fields that affect output (for hashing).

`tracelens.PromptSpec` ¶

Bases: BaseModel

Specification of prompts used in the agent.

Stores hashes of prompt templates for traceability without storing the full prompts (which may be long or sensitive).

`from_prompts(system_prompt=None, prompt_template=None, prompt_version=None, store_full_prompts=False)` `classmethod` ¶

Create PromptSpec from actual prompts.

Parameters:

Name	Type	Description	Default
`system_prompt`	`str \| None`	The system prompt text	`None`
`prompt_template`	`str \| None`	The prompt template text	`None`
`prompt_version`	`str \| None`	Optional version identifier	`None`
`store_full_prompts`	`bool`	Whether to store full prompts (default False)	`False`

`to_hash_dict()` ¶

Return dict of fields for hashing.

`tracelens.ToolSpec` ¶

Bases: BaseModel

Specification of a tool available to the agent.

Captures tool identity and version for reproducibility.

`to_hash_dict()` ¶

Return dict for hashing.

`tracelens.AgentSpec` ¶

Bases: BaseModel

Specification of the agent being evaluated.

Captures agent identity, version, and graph structure.

`to_hash_dict()` ¶

Return dict for hashing.

`tracelens.InfraConfig` ¶

Bases: BaseModel

Infrastructure / runtime-environment configuration.

Agentic evals are end-to-end system tests: the runtime environment is part of the problem-solving process. Resource limits, time budgets, and concurrency all influence what strategies an agent can use, which means infrastructure configuration is a first-class experimental variable — not passive scaffolding.

Anthropic's "Quantifying infrastructure noise in agentic coding evals" (Feb 2026) showed that infrastructure config alone can swing Terminal-Bench 2.0 scores by ~6 percentage points, often more than the leaderboard gap between frontier models. Their recommendations are baked into the shape of this spec:

Record both a guaranteed allocation and a separate hard kill threshold, per task (see cpu_guaranteed / cpu_hard_limit and memory_guaranteed_mb / memory_hard_limit_mb). Pinning them to the same value leaves zero headroom for transient spikes and produces spurious infra failures.
Capture the sandboxing provider, because enforcement semantics differ across runtimes (Kubernetes vs. Docker vs. Fly.io vs. bare containers).
Keep observational fields (hostname, container_id, wall_clock_start_utc) out of the fingerprint so two runs with identical resource configs on different hosts collide to the same fingerprint.

See: https://www.anthropic.com/engineering/infrastructure-noise

`to_hash_dict()` ¶

Return dict of behavior-affecting fields for hashing.

Observational fields (hostname, container_id, wall_clock_start_utc) are intentionally excluded: two runs with identical resource configs on different hosts should collide to the same fingerprint.

`tracelens.EnvironmentSpec` ¶

Bases: BaseModel

Specification of the execution environment.

Captures build/deployment information for traceability.

`to_hash_dict()` ¶

Return dict for hashing.

LLM judge providers¶

`tracelens.LLMProvider` ¶

Bases: ABC

Abstract base class for LLM providers.

`complete(prompt, **kwargs)` `abstractmethod` `async` ¶

Send a prompt to the LLM and return the text response.

`tracelens.InMemoryProvider` ¶

Bases: LLMProvider

Testing provider that returns canned responses.

Cycles through the provided responses list and records all prompts.

`tracelens.create_provider(model_or_alias, **kwargs)` ¶

Create an LLM provider from an alias.

Parameters:

Name	Type	Description	Default
`model_or_alias`	`str`	Currently only `"in-memory"` is supported. For real provider calls, subclass `LLMProvider` directly — tracelens no longer ships a built-in LiteLLM/OpenAI/Anthropic wrapper (see module docstring for the canonical pattern).	required
`**kwargs`	`Any`	Passed to the provider constructor. For `"in-memory"`, use `responses=[...]` to seed canned responses.	`{}`

Returns:

Type	Description
`LLMProvider`	An LLMProvider instance.

Raises:

Type	Description
`ValueError`	If `model_or_alias` is anything other than `"in-memory"`. The error message points at the subclassing pattern so callers know what to do next.

Reporting¶

`tracelens.ReportGenerator` ¶

Generates evaluation reports from TrialBatch results.

Example

gen = ReportGenerator() report = gen.build_report(batch) print(gen.render_markdown(report))

`build_report(batch, baseline_manager=None)` ¶

Build a ReportData from a TrialBatch.

`render_markdown(report)` ¶

Render a human-readable markdown report.

`render_ci_summary(report)` ¶

Render a compact CI-friendly summary.

`render_html(report)` ¶

Render a self-contained HTML dashboard with inline CSS and SVG charts.

`tracelens.ReportData` `dataclass` ¶

Complete evaluation report.

`tracelens.TaskSummary` `dataclass` ¶

Per-task summary statistics.

API Reference¶

Core models¶

tracelens.Task ¶

matches_filter(tags=None, categories=None, difficulties=None) ¶

tracelens.TaskLoader ¶

load(source) abstractmethod ¶

save(tasks, destination) abstractmethod ¶

tracelens.JSONTaskLoader ¶

load(source) ¶

save(tasks, destination) ¶

tracelens.EvalSet ¶

filter_tasks(tags=None, categories=None, difficulties=None, max_tasks=None) ¶

filtered_eval_set(tags=None, categories=None, difficulties=None, max_tasks=None) ¶

add_task(task) ¶

remove_task(task_id) ¶

get_task(task_id) ¶

tracelens.Transcript ¶

has_streaming_data property ¶

first_token_latency_ms property ¶

streaming_duration_ms property ¶

streaming_token_count property ¶

duration_ms property ¶

total_tokens property ¶

input_tokens property ¶

output_tokens property ¶

has_errors property ¶

llm_calls_count property ¶

tool_calls_count property ¶

add_streaming_event(event) ¶

add_step(step) ¶

get_tool_calls_by_name(name) ¶

get_steps_by_type(step_type) ¶

to_dict() ¶

from_dict(data) classmethod ¶

to_summary() ¶

tracelens.TranscriptStep ¶

is_error property ¶

tracelens.StreamingEvent ¶

tracelens.StreamingEventType ¶

tracelens.Outcome ¶

model_post_init(__context) ¶

to_summary_dict() ¶

to_ci_dict() ¶

tracelens.GradeLevel ¶

from_score(score) classmethod ¶

tracelens.Trial ¶

... execute agent ...¶

duration_ms property ¶

passed property ¶

aggregate_score property ¶

is_complete property ¶

is_successful property ¶

has_grader_error property ¶

is_infra_failure property ¶

fingerprint property ¶

fingerprint_short property ¶

add_outcome(outcome) ¶

get_outcome_by_grader(grader_id) ¶

get_metric(metric_name) ¶

to_summary_dict() ¶

to_ci_dict() ¶

tracelens.TrialBatch ¶

total_count property ¶

completed_count property ¶

passed_count property ¶

pass_rate property ¶

infra_error_count property ¶

infra_error_rate property ¶

total_input_tokens property ¶

total_output_tokens property ¶

total_tokens property ¶

grader_error_count property ¶

grader_error_rate property ¶

all_complete property ¶

add_trial(trial) ¶

get_trials_for_task(task_id) ¶

to_dict() ¶

from_dict(data) classmethod ¶

get_pass_results_by_task() ¶

tracelens.TrialStatus ¶

`tracelens.Task` ¶

`matches_filter(tags=None, categories=None, difficulties=None)` ¶

`tracelens.TaskLoader` ¶

`load(source)` `abstractmethod` ¶

`save(tasks, destination)` `abstractmethod` ¶

`tracelens.JSONTaskLoader` ¶

`load(source)` ¶

`save(tasks, destination)` ¶

`tracelens.EvalSet` ¶

`filter_tasks(tags=None, categories=None, difficulties=None, max_tasks=None)` ¶

`filtered_eval_set(tags=None, categories=None, difficulties=None, max_tasks=None)` ¶

`add_task(task)` ¶

`remove_task(task_id)` ¶

`get_task(task_id)` ¶

`tracelens.Transcript` ¶

`has_streaming_data` `property` ¶

`first_token_latency_ms` `property` ¶

`streaming_duration_ms` `property` ¶

`streaming_token_count` `property` ¶

`duration_ms` `property` ¶

`total_tokens` `property` ¶

`input_tokens` `property` ¶

`output_tokens` `property` ¶

`has_errors` `property` ¶

`llm_calls_count` `property` ¶

`tool_calls_count` `property` ¶

`add_streaming_event(event)` ¶

`add_step(step)` ¶

`get_tool_calls_by_name(name)` ¶

`get_steps_by_type(step_type)` ¶

`to_dict()` ¶

`from_dict(data)` `classmethod` ¶

`to_summary()` ¶

`tracelens.TranscriptStep` ¶

`is_error` `property` ¶

`tracelens.StreamingEvent` ¶

`tracelens.StreamingEventType` ¶

`tracelens.Outcome` ¶

`model_post_init(__context)` ¶

`to_summary_dict()` ¶

`to_ci_dict()` ¶

`tracelens.GradeLevel` ¶

`from_score(score)` `classmethod` ¶

`tracelens.Trial` ¶

`duration_ms` `property` ¶

`passed` `property` ¶

`aggregate_score` `property` ¶

`is_complete` `property` ¶

`is_successful` `property` ¶

`has_grader_error` `property` ¶

`is_infra_failure` `property` ¶

`fingerprint` `property` ¶

`fingerprint_short` `property` ¶

`add_outcome(outcome)` ¶

`get_outcome_by_grader(grader_id)` ¶

`get_metric(metric_name)` ¶

`to_summary_dict()` ¶

`to_ci_dict()` ¶

`tracelens.TrialBatch` ¶

`total_count` `property` ¶

`completed_count` `property` ¶

`passed_count` `property` ¶

`pass_rate` `property` ¶

`infra_error_count` `property` ¶

`infra_error_rate` `property` ¶

`total_input_tokens` `property` ¶

`total_output_tokens` `property` ¶

`total_tokens` `property` ¶

`grader_error_count` `property` ¶

`grader_error_rate` `property` ¶

`all_complete` `property` ¶

`add_trial(trial)` ¶

`get_trials_for_task(task_id)` ¶

`to_dict()` ¶

`from_dict(data)` `classmethod` ¶

`get_pass_results_by_task()` ¶

`tracelens.TrialStatus` ¶

`tracelens.InfraError` ¶

`tracelens.AgentAdapter` ¶

`setup(task)` `async` ¶