API Reference¶
Auto-generated from docstrings. Everything below is importable from the package
root (from tracelens import ...) and covered by the stability policy — submodule
paths may move, so import from the root for anything you depend on long-term.
Tip
New here? Read Core Concepts & Glossary first for the mental model, then the User Guide for guided usage. This page is the exhaustive index.
Core models¶
tracelens.Task
¶
Bases: BaseModel
Represents a single evaluation task/test case.
A Task defines: - Input data to feed to the agent - Optional expected outputs for validation - Metadata for filtering and categorization - Configuration for execution
Example
task = Task( name="Portfolio website decomposition", input_data={ "goal": "Build a personal portfolio website", "user_context": {"experience": "beginner", "hours_per_week": 15} }, category="programming", tags=["web", "beginner"], )
matches_filter(tags=None, categories=None, difficulties=None)
¶
Check if task matches filter criteria.
tracelens.TaskLoader
¶
tracelens.JSONTaskLoader
¶
Bases: TaskLoader
Load tasks from JSON files.
Supports: - Single file with {"tasks": [...]} or single task object - Directory of JSON files
tracelens.EvalSet
¶
Bases: BaseModel
Collection of tasks for evaluation.
An EvalSet groups related tasks together for batch evaluation. It also defines default configuration for all tasks in the set.
Example
eval_set = EvalSet( name="Goal Decomposition Suite v1", tasks=tasks, default_num_runs=5, default_grader_ids=["quality", "personalization"], )
filter_tasks(tags=None, categories=None, difficulties=None, max_tasks=None)
¶
Filter tasks by criteria.
filtered_eval_set(tags=None, categories=None, difficulties=None, max_tasks=None)
¶
Return a new EvalSet with only tasks matching the filter criteria.
add_task(task)
¶
Add a task to the set.
remove_task(task_id)
¶
Remove a task by ID. Returns True if found and removed.
get_task(task_id)
¶
Get a task by ID.
tracelens.Transcript
¶
Bases: BaseModel
Complete record of agent execution for a task.
A Transcript captures everything that happened during an agent's execution, providing a complete audit trail for grading and debugging.
Example
transcript = Transcript( transcript_id=str(uuid.uuid4()), task_id=task.task_id, agent_name="goal_decomposition", started_at=utc_now(), ) transcript.steps.append(step) transcript.final_output = result transcript.completed_at = utc_now()
has_streaming_data
property
¶
Whether this transcript contains streaming events.
first_token_latency_ms
property
¶
Time to first token event in milliseconds, or None if no streaming.
streaming_duration_ms
property
¶
Total streaming duration from first to last event, or None if no streaming.
streaming_token_count
property
¶
Total tokens across all streaming TOKEN events.
duration_ms
property
¶
Calculate execution duration in milliseconds.
total_tokens
property
¶
Calculate total token usage across all steps.
input_tokens
property
¶
Calculate total input tokens.
output_tokens
property
¶
Calculate total output tokens.
has_errors
property
¶
Check if any errors occurred during execution.
llm_calls_count
property
¶
Count LLM calls.
tool_calls_count
property
¶
Count tool calls.
add_streaming_event(event)
¶
Append a streaming event to the transcript.
add_step(step)
¶
Add a step to the transcript.
get_tool_calls_by_name(name)
¶
Get all tool calls with a specific name.
get_steps_by_type(step_type)
¶
Get all steps of a specific type.
to_dict()
¶
Serialize to a JSON-safe dict with full round-trip fidelity.
from_dict(data)
classmethod
¶
Reconstruct a Transcript from a dict produced by to_dict().
to_summary()
¶
Create a summary dict for reporting.
tracelens.TranscriptStep
¶
Bases: BaseModel
A single step in agent execution.
Steps are the atomic units of a transcript. Each step represents one action the agent took (LLM call, tool call, etc.).
is_error
property
¶
Check if this step is an error.
tracelens.StreamingEvent
¶
Bases: BaseModel
A single event in a streaming response.
Timestamps are in milliseconds since stream start for precise inter-token latency analysis.
tracelens.StreamingEventType
¶
Bases: StrEnum
Types of streaming events for real-time token delivery.
tracelens.Outcome
¶
Bases: BaseModel
Result of grading a trial.
An Outcome contains: - Primary pass/fail determination - Normalized score (0-1) - Detailed metrics dict (grader-specific) - Optional feedback and reasoning
Example
outcome = Outcome( trial_id="...", grader_id="quality", passed=True, score=0.85, metrics={"specificity": 0.9, "personalization": 0.8}, feedback="Tasks are specific but could use more context references", )
tracelens.GradeLevel
¶
Bases: str, Enum
Categorical grade levels for human-readable results.
from_score(score)
classmethod
¶
Convert a normalized score to a grade level.
tracelens.Trial
¶
Bases: BaseModel
A single execution of a task.
A Trial tracks: - Which task is being executed - The run index (for pass@k with multiple runs) - The execution transcript - Grading outcomes from all graders - Status and timing
Example
trial = Trial( task_id=task.task_id, run_index=0, total_runs=5, ) trial.status = TrialStatus.RUNNING trial.started_at = utc_now()
... execute agent ...¶
trial.transcript = transcript trial.status = TrialStatus.COMPLETED trial.completed_at = utc_now()
duration_ms
property
¶
Calculate trial duration in milliseconds.
passed
property
¶
Trial passes if ALL outcomes pass.
aggregate_score
property
¶
Average score across all outcomes.
is_complete
property
¶
Check if trial has finished (successfully or not).
is_successful
property
¶
Check if trial completed without errors.
has_grader_error
property
¶
Whether any grader crashed while grading this trial.
Grader crashes are synthesized as failed outcomes so the trial stays conservative (not passed), but they must be counted separately — they measure the eval harness, not the agent.
is_infra_failure
property
¶
Whether this trial failed due to infrastructure, not the agent.
Infra failures (OOM kills, network errors, sandbox terminations) should be counted separately from task failures when interpreting scores — otherwise noisy infra inflates the apparent failure rate.
fingerprint
property
¶
Get decision spec fingerprint from transcript.
fingerprint_short
property
¶
Get short decision spec fingerprint from transcript.
add_outcome(outcome)
¶
Add a grading outcome to this trial.
get_outcome_by_grader(grader_id)
¶
Get outcome from a specific grader.
get_metric(metric_name)
¶
Get a specific metric value from any outcome.
to_summary_dict()
¶
Return a summary suitable for reporting.
to_ci_dict()
¶
Return a compact dict for CI output.
tracelens.TrialBatch
¶
Bases: BaseModel
Collection of trials for batch processing.
Useful for running multiple trials in parallel and aggregating results.
total_count
property
¶
Total number of trials.
completed_count
property
¶
Number of completed trials.
passed_count
property
¶
Number of passed trials.
pass_rate
property
¶
Pass rate across all trials.
infra_error_count
property
¶
Number of trials that failed due to infrastructure issues.
infra_error_rate
property
¶
Fraction of trials that hit infrastructure failures.
Report this alongside pass_rate: Anthropic's infra-noise study
found that infra error rates can move by 5+ percentage points
purely from resource-configuration changes. A spike in
infra_error_rate between two baselines is a strong hint that
the regression is noise, not a real capability drop.
total_input_tokens
property
¶
Total input tokens across all trial transcripts.
total_output_tokens
property
¶
Total output tokens across all trial transcripts.
total_tokens
property
¶
Total tokens (input + output) across all trial transcripts.
grader_error_count
property
¶
Number of trials where at least one grader crashed.
grader_error_rate
property
¶
Fraction of trials affected by grader crashes.
Report this alongside pass_rate: a spike here means the
grading harness is broken, not that the agent regressed.
all_complete
property
¶
Check if all trials are complete.
add_trial(trial)
¶
Add a trial to the batch.
get_trials_for_task(task_id)
¶
Get all trials for a specific task.
to_dict()
¶
Serialize to a JSON-safe dict with full round-trip fidelity.
from_dict(data)
classmethod
¶
Reconstruct a TrialBatch from a dict produced by to_dict().
get_pass_results_by_task()
¶
Get pass/fail results grouped by task.
Returns dict mapping task_id to list of boolean pass results. Useful for computing pass@k.
tracelens.TrialStatus
¶
Bases: str, Enum
Status of a trial.
tracelens.InfraError
¶
Bases: Exception
Raised by adapters when a failure is known to be infrastructural.
Separating infra failures from task-level failures matters because they mean different things for evaluation scores. A pod killed for exceeding its memory limit tells you nothing about the agent's capability — but it does tell you the eval's resource configuration is too tight (see Anthropic's "Quantifying infrastructure noise in agentic coding evals", which measured infra error rates dropping from 5.8% at strict enforcement to 0.5% uncapped).
When the runner catches this exception (or other known-infra exception
types like MemoryError, ConnectionError, TimeoutError from
the network stack, or OSError), the trial's status is set to
TrialStatus.INFRA_ERROR rather than FAILED, and the infra error
rate is surfaced separately in reports so you can decide whether a
regression is real or a noise artefact.
Example
class MyAdapter(AgentAdapter): async def run(self, task): try: return await do_work(task) except httpx.ConnectError as exc: raise InfraError(f"upstream API unreachable: {exc}") from exc
Execution¶
tracelens.AgentAdapter
¶
Bases: ABC
Abstract base class for agent adapters.
Adapters bridge the evaluation runner to the agent being evaluated.
Implement run() to invoke your agent and return a Transcript.
Optionally override setup() and teardown() for lifecycle management.
The runner guarantees teardown is called even if run() fails.
Example
class MyAdapter(AgentAdapter): async def setup(self, task: Task) -> None: self.db = await create_test_database()
async def run(self, task: Task) -> Transcript:
result = await my_agent.invoke(task.input_data)
transcript = self.start_transcript(task)
transcript.final_output = result
transcript.completed_at = utc_now()
return transcript
async def teardown(self, task: Task, transcript: Transcript | None) -> None:
await self.db.cleanup()
setup(task)
async
¶
Called before run(). Override for preparation. Default: no-op.
teardown(task, transcript)
async
¶
Called after run(), even on failure. Override for cleanup. Default: no-op.
run(task)
abstractmethod
async
¶
Run the agent on a task and return a transcript.
start_transcript(task)
¶
Helper to create a Transcript with timing started.
record_error(transcript, error)
¶
Helper to record an exception in a transcript.
tracelens.SimpleAdapter
¶
Bases: AgentAdapter
Wraps any async callable as an AgentAdapter.
Useful for testing and simple single-shot agents that take input_data and return a result.
Example
async def my_fn(input_data: dict) -> dict: return {"answer": "42"}
adapter = SimpleAdapter(my_fn)
run(task)
async
¶
Invoke the wrapped function and build a transcript.
tracelens.HTTPAPIAdapter
¶
Bases: AgentAdapter
Adapter that invokes agents via HTTP API calls.
Supports authentication, retry with exponential backoff, and customizable request/response handling.
Example
config = HTTPAdapterConfig( base_url="https://api.example.com", endpoint="/v1/agent/run", auth=AuthConfig(scheme=AuthScheme.BEARER, token="sk-..."), ) adapter = HTTPAPIAdapter(config)
Use with EvaluationRunner¶
build_request_body(task)
¶
Build the HTTP request body from a task. Override to customize.
parse_response_body(data)
¶
Extract the agent's answer from response JSON. Override to customize.
run(task)
async
¶
Invoke the HTTP agent and return a transcript.
close()
async
¶
Close the underlying httpx client.
teardown(task, transcript)
async
¶
No-op per trial — call close() explicitly when done with all trials.
tracelens.HTTPAdapterConfig
¶
Bases: BaseModel
Full configuration for HTTPAPIAdapter.
tracelens.AuthConfig
¶
Bases: BaseModel
Authentication configuration for HTTP requests.
apply_to_headers(headers)
¶
Apply auth config to a headers dict, returning the updated dict.
tracelens.AuthScheme
¶
Bases: StrEnum
Supported authentication schemes.
tracelens.RetryConfig
¶
Bases: BaseModel
Retry configuration with exponential backoff.
tracelens.EvaluationRunner
¶
Runs evaluations with concurrency control and timeout enforcement.
Example
runner = EvaluationRunner( adapter=my_adapter, graders=[quality_grader, safety_grader], config=RunnerConfig(num_runs=5, max_concurrency=10), ) batch = await runner.run(eval_set) print(f"Pass rate: {batch.pass_rate:.2%}")
run(eval_set)
async
¶
Run all tasks × runs and grade results.
tracelens.RunnerConfig
dataclass
¶
Configuration for the evaluation runner.
Graders — base classes¶
tracelens.Grader
¶
Bases: ABC
Abstract base class for all graders.
Graders evaluate agent outputs (Transcripts) and produce Outcomes. Subclass either CodeGrader or LLMGrader for specific implementations.
Example
class MyGrader(CodeGrader): def compute_metrics(self, transcript, task): return {"accuracy": 0.95}
def determine_pass(self, metrics, task):
return metrics["accuracy"] >= 0.9, metrics["accuracy"]
grader_type
abstractmethod
property
¶
Return the type of this grader.
is_deterministic
property
¶
Whether this grader produces deterministic results.
requires_llm
property
¶
Whether this grader requires LLM calls.
requires_human
property
¶
Whether this grader requires human input.
role
property
¶
Role of this grader in composite scoring.
is_must_pass
property
¶
Whether this grader must pass for trial to pass.
is_score_contributor
property
¶
Whether this grader contributes to score average.
policy
property
¶
Three-way eval policy for this grader.
is_gate
property
¶
Whether this grader is a gate (fails CI on violation).
is_warn
property
¶
Whether this grader is a warning.
is_track
property
¶
Whether this grader is tracking-only.
grade(transcript, task)
abstractmethod
async
¶
Grade the transcript for the given task.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
transcript
|
Transcript
|
The agent's execution record |
required |
task
|
Task
|
The task being evaluated |
required |
Returns:
| Type | Description |
|---|---|
Outcome
|
An Outcome with pass/fail, score, and metrics |
create_outcome(trial_id, passed, score, metrics=None, feedback=None, **kwargs)
¶
Helper to create an Outcome with common fields.
tracelens.CodeGrader
¶
Bases: Grader
Base class for deterministic code-based graders.
CodeGraders compute metrics from the transcript and then determine pass/fail based on those metrics. They are deterministic - the same input always produces the same output.
Use for: objective metrics (Sharpe ratio, accuracy, latency)
Example
class FinancialGrader(CodeGrader): def compute_metrics(self, transcript, task): returns = transcript.final_output["returns"] return { "sharpe_ratio": calculate_sharpe(returns), "max_drawdown": calculate_max_dd(returns), }
def determine_pass(self, metrics, task):
passed = metrics["sharpe_ratio"] >= 1.0
score = min(metrics["sharpe_ratio"] / 2.0, 1.0)
return passed, score
compute_metrics(transcript, task)
abstractmethod
¶
Compute grading metrics from transcript.
Implement this to extract and calculate metrics from the agent's execution record.
determine_pass(metrics, task)
abstractmethod
¶
Determine pass/fail and score from metrics.
Returns:
| Type | Description |
|---|---|
tuple[bool, float]
|
Tuple of (passed, score) where score is 0-1 normalized. |
grade(transcript, task)
async
¶
Grade by computing metrics and determining pass/fail.
tracelens.LLMGrader
¶
Bases: Grader
Base class for LLM-as-judge graders.
LLMGraders use an LLM to evaluate agent outputs. They are non-deterministic and require careful prompt engineering.
Use for: subjective quality (specificity, personalization, clarity)
Example
class QualityGrader(LLMGrader): def build_grading_prompt(self, transcript, task): return f'''Evaluate the quality of this output:
Score 1-10 on: clarity, completeness, accuracy
Return JSON: {{"score": X, "feedback": "..."}}'''
def parse_llm_response(self, response, task):
data = json.loads(response)
score = data["score"] / 10.0
passed = score >= 0.7
return passed, score, {}, data["feedback"]
build_grading_prompt(transcript, task)
abstractmethod
¶
Build the prompt for LLM grading.
Implement this to create a prompt that instructs the LLM how to evaluate the agent's output.
parse_llm_response(response, task)
abstractmethod
¶
Parse LLM response into structured result.
Returns:
| Type | Description |
|---|---|
tuple[bool, float, dict[str, float], str]
|
Tuple of (passed, score, metrics, feedback) |
grade(transcript, task)
async
¶
Grade by calling LLM and parsing response.
Honors GraderConfig: each attempt (LLM call + parse) is bounded
by timeout_seconds; transient failures — including malformed
responses, which a fresh LLM call often fixes — are retried up to
max_retries times with exponential backoff when
retry_on_error is set. NotImplementedError (no provider
configured) is a setup bug, never retried.
tracelens.CompositeGrader
¶
Bases: Grader
Combines multiple graders with policy-aware aggregation.
Supports three policy tiers: - GATE: Any failure causes entire trial to fail (safety, constraints) - WARN: Failures emit warnings but don't fail by default - TRACK: Pure signals, never affect pass/fail
Also supports legacy GraderRole (MUST_PASS maps to GATE behavior).
The score is always a weighted average of all graders regardless of policy.
must_pass_graders
property
¶
Get all must-pass graders (legacy + gate).
score_contributor_graders
property
¶
Get all score-contributor graders (legacy).
gate_graders
property
¶
Get all gate graders.
warn_graders
property
¶
Get all warn graders.
track_graders
property
¶
Get all track graders.
grade(transcript, task)
async
¶
Grade using policy-aware aggregation.
- Run all graders and collect outcomes
- Compute weighted score from all graders
- Only GATE (or legacy MUST_PASS) failures cause overall failure
- WARN failures are recorded in feedback
- TRACK results are pure signals
tracelens.GraderConfig
¶
Bases: BaseModel
Configuration for a grader.
tracelens.GraderType
¶
Bases: str, Enum
Types of graders.
tracelens.EvalPolicy
¶
Bases: str, Enum
Three-way policy for how a grader's result affects CI.
GATE: Fails CI on violation. Use for hard constraints (valid JSON, no PII). WARN: Warning by default, configurable CI fail. Use for regressions. TRACK: Never fails CI, just produces signals. Use for quality tracking.
tracelens.GraderRole
¶
Bases: str, Enum
Deprecated: use EvalPolicy instead.
Kept for backward compatibility. MUST_PASS maps to GATE, SCORE_CONTRIBUTOR maps to TRACK.
tracelens.BehaviorContract
¶
Bases: BaseModel
Verifiable contract for agent behavior.
to_graders()
¶
Auto-generate a grader suite from this contract.
Each non-empty contract section produces one grader with an appropriate default policy: - output_schema -> JsonSchemaGrader (GATE) - output_model -> StructuredOutputGrader (GATE) - tools_* -> ToolCallGrader (GATE) - max_latency_ms -> LatencyGrader (WARN) - max_tokens -> TokenBudgetGrader (WARN) - must_include/must_not_include -> ContainsGrader (TRACK) - custom_constraints -> ConstraintGrader (GATE)
Graders — built-in library¶
See the Grader Library guide for when to reach for each.
tracelens.JsonSchemaGrader
¶
Bases: CodeGrader
Validate transcript.final_output against a JSON Schema.
Uses the jsonschema library for full schema validation.
Default policy: GATE (schema violations block CI).
tracelens.RegexMatchGrader
¶
Bases: CodeGrader
Check whether str(transcript.final_output) matches each regex pattern.
Default policy: TRACK.
tracelens.ContainsGrader
¶
Bases: CodeGrader
Check whether str(transcript.final_output) contains required strings
and does not contain forbidden strings.
Default policy: TRACK.
tracelens.ConstraintGrader
¶
Bases: CodeGrader
Evaluate a list of heterogeneous constraints against the agent output.
Supported constraint types:
- must_include: str(output) must contain the value
- must_not_include: str(output) must not contain the value
- numeric_range: output[field] must be within [min, max]
- enum: output[field] must be one of the allowed values
Default policy: GATE.
tracelens.StructuredOutputGrader
¶
Bases: CodeGrader
Validate transcript.final_output by parsing it with a Pydantic model.
The model is loaded at grading time via tracelens.execution.registry.load_class
from a dotted path such as "myproject.models.ResponseSchema".
Default policy: GATE.
tracelens.LatencyGrader
¶
Bases: CodeGrader
Check that agent execution completes within a time budget.
Pass if transcript.duration_ms <= max_ms. Score: max(0, 1 - duration/max). Default policy: WARN.
tracelens.TokenBudgetGrader
¶
Bases: CodeGrader
Check that agent execution stays within a token budget.
Pass if transcript.total_tokens <= max_tokens. Score: max(0, 1 - total/max). Default policy: WARN.
tracelens.ToolCallGrader
¶
Bases: CodeGrader
Validate tool call compliance against required/allowed/forbidden lists.
- required_tools: all must be called at least once
- allowed_tools: if provided, only these tools may be called (allowlist)
- forbidden_tools: none of these may be called
Pass if all required called AND no unauthorized AND no forbidden. Default policy: GATE.
tracelens.TraceConsistencyGrader
¶
Bases: CodeGrader
Check agent self-consistency in tool usage and trace patterns.
Metrics: - tool_error_rate: fraction of tool calls that returned errors - unused_tool_results: tool calls with non-None results that are not followed by any AGENT_OUTPUT step - phantom_calls: tools called that are not in expected_tools (if provided)
Pass if tool_error_rate < 0.5 and phantom_calls == 0. Score: 1 - tool_error_rate. Default policy: WARN.
tracelens.EventChainVerifier
¶
Bases: CodeGrader
Verifies expected event sequences in transcripts.
Scans transcript steps to match expected events, checks ordering constraints, and scores based on how many events were found.
Example
config = EventChainConfig( expected_events=[ EventExpectation( event_id="search", match_type=EventMatchType.TOOL_NAME, tool_name="search", ), EventExpectation( event_id="analyze", match_type=EventMatchType.TOOL_NAME, tool_name="analyze", after=["search"], ), ], ordering=OrderingMode.PARTIAL, ) verifier = EventChainVerifier("chain_check", config)
compute_metrics(transcript, task)
¶
Scan transcript and match against expected events.
Uses greedy first-match: each step is matched against the first unmatched expectation it satisfies. Order expectations carefully when multiple expectations could match the same step.
determine_pass(metrics, task)
¶
Determine pass/fail from event matching metrics.
tracelens.EventChainConfig
¶
Bases: BaseModel
Configuration for EventChainVerifier.
tracelens.EventExpectation
¶
Bases: BaseModel
An expected event in the transcript.
tracelens.EventMatchType
¶
Bases: StrEnum
How to match a transcript step against an expectation.
tracelens.OrderingMode
¶
Bases: StrEnum
How to enforce ordering of matched events.
Statistics¶
See pass@k vs pass^k and Statistical Comparison for the concepts.
tracelens.pass_at_k(n, c, k)
¶
Calculate pass@k metric.
Estimates the probability that at least one of k samples passes, given n total samples with c correct. Uses an unbiased estimator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n
|
int
|
Total number of samples |
required |
c
|
int
|
Number of correct/passing samples |
required |
k
|
int
|
Number of samples to consider |
required |
Returns:
| Type | Description |
|---|---|
float
|
Probability of at least one pass in k samples (0.0 to 1.0) |
Example
pass_at_k(10, 7, 5) 0.9916... # Very likely at least 1 of 5 passes
pass_at_k(10, 1, 5) 0.5 # 50% chance at least 1 of 5 passes
tracelens.pass_at_k_estimator(results_per_task, k)
¶
Compute pass@k across multiple tasks.
For each task, computes pass@k using available samples. Returns the average pass@k across all tasks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results_per_task
|
dict[str, list[bool]]
|
Dict mapping task_id to list of pass/fail booleans |
required |
k
|
int
|
Number of samples to consider |
required |
Returns:
| Type | Description |
|---|---|
float
|
Average pass@k across all tasks |
Example
results = { ... "task1": [True, True, False, True, True], ... "task2": [False, True, False, False, True], ... } pass_at_k_estimator(results, k=3) 0.9... # High probability task1 passes, lower for task2
tracelens.PassAtKAnalyzer
¶
Analyzer for pass@k capability metrics.
Computes pass@k for multiple k values and provides confidence intervals.
Example
analyzer = PassAtKAnalyzer(k_values=[1, 3, 5, 10]) results = analyzer.analyze(pass_results_by_task) print(results) # {"pass@1": 0.6, "pass@3": 0.85, "pass@5": 0.95, "pass@10": 0.99}
__init__(k_values=None)
¶
Initialize analyzer with k values to compute.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
k_values
|
list[int] | None
|
List of k values for pass@k. Default: [1, 5, 10] |
None
|
analyze(results_per_task)
¶
Compute pass@k for multiple k values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results_per_task
|
dict[str, list[bool]]
|
Dict mapping task_id to list of pass/fail booleans |
required |
Returns:
| Type | Description |
|---|---|
dict[str, float]
|
Dict mapping "pass@k" to computed value |
compute_confidence_interval(results_per_task, k, confidence=0.95, n_bootstrap=1000)
¶
Compute bootstrap confidence interval for pass@k.
Uses bootstrap resampling over tasks to estimate the confidence interval for pass@k.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results_per_task
|
dict[str, list[bool]]
|
Dict mapping task_id to list of pass/fail booleans |
required |
k
|
int
|
k value for pass@k |
required |
confidence
|
float
|
Confidence level (default 0.95 for 95% CI) |
0.95
|
n_bootstrap
|
int
|
Number of bootstrap samples |
1000
|
Returns:
| Type | Description |
|---|---|
tuple[float, float]
|
Tuple of (lower_bound, upper_bound) |
analyze_with_ci(results_per_task, confidence=0.95, n_bootstrap=1000)
¶
Compute pass@k with confidence intervals.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results_per_task
|
dict[str, list[bool]]
|
Dict mapping task_id to list of pass/fail booleans |
required |
confidence
|
float
|
Confidence level (default 0.95) |
0.95
|
n_bootstrap
|
int
|
Number of bootstrap samples |
1000
|
Returns:
| Type | Description |
|---|---|
dict[str, dict[str, float]]
|
Dict mapping "pass@k" to {"value": ..., "lower": ..., "upper": ...} |
tracelens.pass_to_k(results, k)
¶
Calculate pass^k (consistency) metric.
Measures the proportion of k-length windows where all samples pass. A sliding window approach is used.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results
|
list[bool]
|
List of pass/fail booleans from multiple runs |
required |
k
|
int
|
Number of consecutive passes required |
required |
Returns:
| Type | Description |
|---|---|
float
|
Proportion of k-length windows where all samples pass (0.0 to 1.0) |
Example
pass_to_k([True, True, True, True, True], 3) 1.0 # All windows of 3 pass
pass_to_k([True, True, False, True, True], 3) 0.333... # Only 1 of 3 windows passes
tracelens.pass_to_k_estimator(results_per_task, k)
¶
Compute pass^k (consistency) across multiple tasks.
For each task with enough samples, computes pass^k. Returns the average pass^k across all eligible tasks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results_per_task
|
dict[str, list[bool]]
|
Dict mapping task_id to list of pass/fail booleans |
required |
k
|
int
|
Number of consecutive passes required |
required |
Returns:
| Type | Description |
|---|---|
float
|
Average pass^k across all tasks with >= k samples |
Example
results = { ... "task1": [True, True, True, True, True], ... "task2": [True, True, False, True, True], ... } pass_to_k_estimator(results, k=3) 0.666... # task1: 1.0, task2: 0.333
tracelens.ConsistencyAnalyzer
¶
Analyzer for pass^k consistency metrics.
Computes pass^k for multiple k values and provides reliability scoring.
Example
analyzer = ConsistencyAnalyzer(k_values=[2, 3, 5]) results = analyzer.analyze(pass_results_by_task) print(results) # {"pass^2": 0.8, "pass^3": 0.6, "pass^5": 0.3}
__init__(k_values=None)
¶
Initialize analyzer with k values to compute.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
k_values
|
list[int] | None
|
List of k values for pass^k. Default: [2, 3, 5] |
None
|
analyze(results_per_task)
¶
Compute pass^k for multiple k values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results_per_task
|
dict[str, list[bool]]
|
Dict mapping task_id to list of pass/fail booleans |
required |
Returns:
| Type | Description |
|---|---|
dict[str, float]
|
Dict mapping "pass^k" to computed value |
compute_reliability_score(results_per_task)
¶
Compute overall reliability score.
Combines pass^k metrics weighted by k to give higher weight to longer consistent runs. A higher score indicates more reliable/consistent performance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results_per_task
|
dict[str, list[bool]]
|
Dict mapping task_id to list of pass/fail booleans |
required |
Returns:
| Type | Description |
|---|---|
float
|
Weighted reliability score (0.0 to 1.0) |
compute_stability_metrics(results_per_task)
¶
Compute additional stability metrics.
Returns:
| Type | Description |
|---|---|
dict[str, float]
|
Dict with: |
dict[str, float]
|
|
dict[str, float]
|
|
dict[str, float]
|
|
dict[str, float]
|
|
tracelens.bootstrap_ci(values, confidence=0.95, n_bootstrap=10000, statistic='mean', seed=None)
¶
Compute bootstrap confidence interval for a statistic.
Uses percentile bootstrap method for simplicity and robustness.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
values
|
list[float] | ndarray
|
Sample values |
required |
confidence
|
float
|
Confidence level (default 0.95 for 95% CI) |
0.95
|
n_bootstrap
|
int
|
Number of bootstrap samples |
10000
|
statistic
|
str
|
"mean", "median", or "std" |
'mean'
|
seed
|
int | None
|
Random seed for reproducibility |
None
|
Returns:
| Type | Description |
|---|---|
tuple[float, float, float]
|
Tuple of (point_estimate, lower_bound, upper_bound) |
tracelens.estimate_metric(values, confidence=0.95, n_bootstrap=10000, seed=None)
¶
Estimate a metric with confidence interval.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
values
|
list[float] | ndarray
|
Sample values |
required |
confidence
|
float
|
Confidence level |
0.95
|
n_bootstrap
|
int
|
Bootstrap samples |
10000
|
seed
|
int | None
|
Random seed |
None
|
Returns:
| Type | Description |
|---|---|
MetricEstimate
|
MetricEstimate with mean, std, n, and CI |
tracelens.compare_metrics(baseline_values, current_values, confidence=0.95, n_bootstrap=10000, compute_p_value=False, seed=None)
¶
Compare current metrics against baseline with statistical inference.
Computes bootstrap CI for the difference and determines statistical significance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
baseline_values
|
list[float] | ndarray
|
Baseline sample values |
required |
current_values
|
list[float] | ndarray
|
Current sample values |
required |
confidence
|
float
|
Confidence level for CI |
0.95
|
n_bootstrap
|
int
|
Bootstrap samples |
10000
|
compute_p_value
|
bool
|
Whether to compute permutation p-value |
False
|
seed
|
int | None
|
Random seed |
None
|
Returns:
| Type | Description |
|---|---|
ComparisonResult
|
ComparisonResult with full statistical analysis |
tracelens.compare_to_baseline_summary(baseline_mean, baseline_std, baseline_n, current_mean, current_std, current_n, confidence=0.95)
¶
Compare metrics when only summary statistics are available.
Uses Welch's t-test approximation for the CI when raw data is not available.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
baseline_mean
|
float
|
Baseline mean |
required |
baseline_std
|
float
|
Baseline std deviation |
required |
baseline_n
|
int
|
Baseline sample size |
required |
current_mean
|
float
|
Current mean |
required |
current_std
|
float
|
Current std deviation |
required |
current_n
|
int
|
Current sample size |
required |
confidence
|
float
|
Confidence level |
0.95
|
Returns:
| Type | Description |
|---|---|
ComparisonResult
|
ComparisonResult (note: effect size may be less accurate) |
tracelens.MetricEstimate
dataclass
¶
A metric estimate with uncertainty quantification.
Stores the point estimate along with confidence bounds and sample statistics for research-grade reporting.
tracelens.ComparisonResult
dataclass
¶
Result of comparing two metric estimates.
Contains statistical test results for determining if the difference is significant.
is_regression
property
¶
Check if this represents a statistically significant regression.
A regression occurs when the difference is significantly negative (for higher-is-better metrics).
is_improvement
property
¶
Check if this represents a statistically significant improvement.
effect_magnitude
property
¶
Classify effect size magnitude (Cohen's conventions).
to_dict()
¶
Convert to dictionary for serialization.
tracelens.LatencyAnalyzer
¶
tracelens.LatencyMetrics
¶
Bases: BaseModel
Latency metrics for a single transcript's streaming data.
tracelens.AggregateLatencyMetrics
¶
Bases: BaseModel
Aggregated latency metrics across multiple transcripts.
Baselines & regression detection¶
See the Baseline Regression Tutorial.
tracelens.BaselineManager
¶
Manages baseline storage and retrieval.
Baselines are stored in a JSON file and can be versioned with git for tracking changes over time.
Example
manager = BaselineManager("baselines/baselines.json")
Get existing baseline¶
baseline = manager.get_baseline("btc_backtest")
Update baseline¶
manager.update_baseline("btc_backtest", {"sharpe_ratio": 1.5})
Save changes¶
manager.save()
__init__(baselines_path)
¶
Initialize the baseline manager.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
baselines_path
|
str | Path
|
Path to the baselines JSON file |
required |
save()
¶
Save baselines to JSON file.
get_baseline(task_id)
¶
Get baseline for a task.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task_id
|
str
|
The task identifier |
required |
Returns:
| Type | Description |
|---|---|
TaskBaseline | None
|
TaskBaseline if found, None otherwise |
set_baseline(baseline)
¶
Set baseline for a task.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
baseline
|
TaskBaseline
|
The task baseline to store |
required |
update_baseline(task_id, metrics, metric_stds=None, sample_size=1, keep_thresholds=True)
¶
Update or create a baseline with new metric values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task_id
|
str
|
The task identifier |
required |
metrics
|
dict[str, float]
|
Dict of metric_name -> value |
required |
metric_stds
|
dict[str, float] | None
|
Optional dict of metric_name -> std deviation |
None
|
sample_size
|
int
|
Number of samples used to compute metrics |
1
|
keep_thresholds
|
bool
|
Keep existing thresholds when updating |
True
|
Returns:
| Type | Description |
|---|---|
TaskBaseline
|
The updated TaskBaseline |
list_tasks()
¶
List all task IDs with baselines.
compare_to_baseline(task_id, current_metrics)
¶
Compare current metrics to baseline.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task_id
|
str
|
The task identifier |
required |
current_metrics
|
dict[str, float]
|
Dict of metric_name -> current value |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dict of metric_name -> comparison dict with: |
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
create_canary_baseline(task_id, metrics, fingerprint, metric_stds=None, sample_size=1, task_name=None)
¶
Create a protected canary baseline.
Canary baselines never auto-update and are tied to a specific DecisionSpec fingerprint. Use for safety-critical metrics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task_id
|
str
|
The task identifier |
required |
metrics
|
dict[str, float]
|
Dict of metric_name -> value |
required |
fingerprint
|
str
|
DecisionSpec fingerprint (required for canary) |
required |
metric_stds
|
dict[str, float] | None
|
Optional dict of metric_name -> std deviation |
None
|
sample_size
|
int
|
Number of samples used to compute metrics |
1
|
task_name
|
str | None
|
Optional human-readable name |
None
|
Returns:
| Type | Description |
|---|---|
TaskBaseline
|
The created canary baseline |
Raises:
| Type | Description |
|---|---|
ValueError
|
If fingerprint is not provided |
create_capability_baseline(task_id, metrics, metric_stds=None, sample_size=1, task_name=None, promotion_policy=None, fingerprint=None)
¶
Create a capability baseline that can auto-update.
Capability baselines track current agent capability and can be promoted when performance improves.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task_id
|
str
|
The task identifier |
required |
metrics
|
dict[str, float]
|
Dict of metric_name -> value |
required |
metric_stds
|
dict[str, float] | None
|
Optional dict of metric_name -> std deviation |
None
|
sample_size
|
int
|
Number of samples used to compute metrics |
1
|
task_name
|
str | None
|
Optional human-readable name |
None
|
promotion_policy
|
PromotionPolicy | None
|
Custom promotion policy (default allows auto-promotion) |
None
|
fingerprint
|
str | None
|
Optional DecisionSpec fingerprint |
None
|
Returns:
| Type | Description |
|---|---|
TaskBaseline
|
The created capability baseline |
try_promote(task_id, current_metrics, metric_stds=None, sample_size=1, fingerprint=None)
¶
Try to promote a baseline if criteria are met.
This method checks if the baseline can be promoted and performs the promotion if allowed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task_id
|
str
|
The task identifier |
required |
current_metrics
|
dict[str, float]
|
New metric values |
required |
metric_stds
|
dict[str, float] | None
|
Optional standard deviations |
None
|
sample_size
|
int
|
Number of samples |
1
|
fingerprint
|
str | None
|
New fingerprint for the promoted baseline |
None
|
Returns:
| Type | Description |
|---|---|
tuple[bool, str]
|
Tuple of (was_promoted, reason) |
force_promote(task_id, current_metrics, metric_stds=None, sample_size=1, reason='manual', fingerprint=None)
¶
Force promote a baseline, bypassing policy checks.
Use this for manual promotions or emergency updates. Even canary baselines can be force-promoted.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task_id
|
str
|
The task identifier |
required |
current_metrics
|
dict[str, float]
|
New metric values |
required |
metric_stds
|
dict[str, float] | None
|
Optional standard deviations |
None
|
sample_size
|
int
|
Number of samples |
1
|
reason
|
str
|
Reason for the forced promotion |
'manual'
|
fingerprint
|
str | None
|
New fingerprint for the promoted baseline |
None
|
Returns:
| Type | Description |
|---|---|
TaskBaseline
|
The promoted baseline |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no baseline exists |
list_canary_baselines()
¶
List all canary (protected) baseline task IDs.
list_capability_baselines()
¶
List all capability baseline task IDs.
list_stale_baselines()
¶
List all stale baseline task IDs.
get_baseline_summary(task_id)
¶
Get a summary of a baseline for reporting.
compare_to_baseline_with_ci(task_id, current_values, confidence=0.95, n_bootstrap=10000)
¶
Compare current metrics to baseline with bootstrap CI.
Uses statistical inference to determine if differences are significant. This is the research-grade comparison method.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task_id
|
str
|
The task identifier |
required |
current_values
|
dict[str, list[float]]
|
Dict of metric_name -> list of sample values |
required |
confidence
|
float
|
Confidence level (default 0.95 for 95% CI) |
0.95
|
n_bootstrap
|
int
|
Number of bootstrap samples |
10000
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dict of metric_name -> comparison result with: |
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
tracelens.BaselineType
¶
Bases: str, Enum
Type of baseline determining update semantics.
Protected baseline that never auto-updates.
- Represents the absolute performance floor
- Only manually updated after explicit approval
- Tied to a specific DecisionSpec fingerprint
- Use for safety-critical or business-critical metrics
Baseline that can auto-update on improvements.
- Tracks the agent's current capability ceiling
- Auto-updates when performance improves significantly
- Maintains version history for rollback
- Use for tracking progress and catching regressions
Baseline for experimental features.
- Loose thresholds, high variance expected
- Auto-updates more aggressively
- Use during active development
tracelens.PromotionPolicy
¶
Bases: BaseModel
Policy for automatic baseline promotion.
Controls when and how baselines can be automatically updated.
tracelens.MetricBaseline
¶
Bases: BaseModel
Baseline for a single metric.
Stores the expected value, standard deviation, and thresholds for regression detection.
tracelens.TaskBaseline
¶
Bases: BaseModel
Baseline for a complete task.
Groups metric baselines together with metadata about when the baseline was created.
Supports two types of baselines: - CANARY: Protected, never auto-updates (safety floor) - CAPABILITY: Can auto-update on improvements
Example
Create a canary baseline (protected)¶
baseline = TaskBaseline( task_id="safety_check", baseline_type=BaselineType.CANARY, fingerprint="abc123def456", # Tied to specific config )
Create a capability baseline (can auto-update)¶
baseline = TaskBaseline( task_id="quality_score", baseline_type=BaselineType.CAPABILITY, promotion_policy=PromotionPolicy(min_improvement_relative=0.05), )
is_canary
property
¶
Check if this is a canary (protected) baseline.
allows_auto_promotion
property
¶
Check if this baseline allows automatic promotion.
is_stale
property
¶
Check if baseline is stale based on max_age_days policy.
in_cooldown
property
¶
Check if baseline is in promotion cooldown period.
add_metric(metric_name, value, std=0.0, sample_size=1, absolute_threshold=None, relative_threshold=None, higher_is_better=True)
¶
Add or update a metric baseline.
get_metric(metric_name)
¶
Get a specific metric baseline.
can_promote(current_metrics, sample_size=1)
¶
Check if baseline can be promoted with new metrics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
current_metrics
|
dict[str, float]
|
New metric values |
required |
sample_size
|
int
|
Number of samples used to compute metrics |
1
|
Returns:
| Type | Description |
|---|---|
tuple[bool, str]
|
Tuple of (can_promote, reason) |
promote(current_metrics, metric_stds=None, sample_size=1, reason='auto', fingerprint=None)
¶
Promote baseline to new values.
Archives current version and updates metrics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
current_metrics
|
dict[str, float]
|
New metric values |
required |
metric_stds
|
dict[str, float] | None
|
Optional standard deviations |
None
|
sample_size
|
int
|
Number of samples |
1
|
reason
|
str
|
Reason for promotion |
'auto'
|
fingerprint
|
str | None
|
Optional new fingerprint |
None
|
tracelens.RegressionDetector
¶
Detects regressions between baseline and current results.
Uses statistical tests to determine if observed differences are significant.
Example
detector = RegressionDetector(significance_level=0.05) report = detector.compare(baseline, current_results)
if report.should_block_ci(): sys.exit(1)
__init__(significance_level=0.05, min_delta_percent=5.0, noise_band_absolute=DEFAULT_NOISE_BAND_ABSOLUTE, noise_band_aware=True)
¶
Initialize the detector.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
significance_level
|
float
|
P-value threshold for significance |
0.05
|
min_delta_percent
|
float
|
Minimum percentage change to consider |
5.0
|
noise_band_absolute
|
float
|
Absolute delta below which a regression on a pass-rate-style metric (0-1 scale) is considered "within the infra-noise band" when the baseline and current infra configs don't match. Defaults to 0.03 (3pp), following Anthropic's infra-noise study. |
DEFAULT_NOISE_BAND_ABSOLUTE
|
noise_band_aware
|
bool
|
If True, compare_with_specs() will mark
sub-noise-band regressions as |
True
|
compare(baseline, current_results)
¶
Compare current results against baseline.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
baseline
|
TaskBaseline
|
The baseline to compare against |
required |
current_results
|
list[dict[str, Any]]
|
List of result dicts, each with metric values |
required |
Returns:
| Type | Description |
|---|---|
RegressionReport
|
RegressionReport with detected regressions |
compare_multiple(baselines, current_results)
¶
Compare multiple tasks against their baselines.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
baselines
|
dict[str, TaskBaseline]
|
Dict of task_id -> TaskBaseline |
required |
current_results
|
dict[str, list[dict[str, Any]]]
|
Dict of task_id -> list of result dicts |
required |
Returns:
| Type | Description |
|---|---|
dict[str, RegressionReport]
|
Dict of task_id -> RegressionReport |
compare_with_specs(baseline, current_results, baseline_spec=None, current_spec=None)
¶
Compare with DecisionSpec awareness for infra-noise detection.
Wraps compare() and additionally:
- Diffs the two DecisionSpecs'
infrasections and records any mismatch inreport.infra_config_mismatchandreport.infra_config_diff. - For each detected regression, if the absolute delta falls
within
noise_band_absolute(default 3pp) AND the infra configs don't match, mark the regression'swithin_noise_bandflag to True. Those regressions still show up in the report but are excluded fromblocking_regressionsso a defaultshould_block_ci()call won't gate a merge on ambiguous noise.
When either spec is omitted, this degrades to ordinary
compare() behavior with infra_config_mismatch=False.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
baseline
|
TaskBaseline
|
TaskBaseline to compare against. |
required |
current_results
|
list[dict[str, Any]]
|
Current run's metric values. |
required |
baseline_spec
|
DecisionSpec | None
|
DecisionSpec captured when the baseline was recorded. Optional but enables infra-noise reasoning. |
None
|
current_spec
|
DecisionSpec | None
|
DecisionSpec for the current run. Optional but enables infra-noise reasoning. |
None
|
Returns:
| Type | Description |
|---|---|
RegressionReport
|
RegressionReport with |
RegressionReport
|
|
RegressionReport
|
|
tracelens.RegressionReport
¶
Bases: BaseModel
Complete regression analysis report.
blocking_regressions
property
¶
Regressions that should actually block CI.
Excludes any regression marked within_noise_band — those are
within the infra-noise uncertainty and shouldn't gate merges
until the eval configuration is matched.
should_block_ci(threshold=RegressionSeverity.MODERATE, ignore_noise_band=True)
¶
Determine if CI should be blocked based on severity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold
|
RegressionSeverity
|
Minimum severity to block. Default: MODERATE |
MODERATE
|
ignore_noise_band
|
bool
|
If True (default), within-noise-band regressions don't count — a 2pp drop under a mismatched infra config is ambiguous and shouldn't auto-gate merges per Anthropic's infra-noise guidance. Pass False to treat every regression as blocking regardless of noise. |
True
|
Returns:
| Type | Description |
|---|---|
bool
|
True if CI should be blocked |
to_ci_output()
¶
Generate CI-friendly output.
tracelens.RegressionSeverity
¶
Bases: str, Enum
Severity levels for regressions.
tracelens.MetricRegression
¶
Bases: BaseModel
Detected regression in a specific metric.
Reproducibility (DecisionSpec)¶
See Reproducibility & DecisionSpec.
tracelens.DecisionSpec
¶
Bases: BaseModel
Complete specification of all decision-affecting parameters.
A DecisionSpec is an immutable fingerprint of everything that could affect agent behavior. Two runs with the same DecisionSpec fingerprint should produce statistically similar results (given the same task input).
Example
spec = DecisionSpec( model=ModelConfig( provider="anthropic", model_id="claude-3-opus-20240229", temperature=0.7, ), prompts=PromptSpec.from_prompts( system_prompt="You are a helpful assistant...", prompt_template="Given {context}, do {task}...", ), tools=[ ToolSpec(name="search", version="1.0"), ToolSpec(name="calculator", version="2.1"), ], agent=AgentSpec( agent_name="goal_decomposition", agent_version="1.0.0", ), environment=EnvironmentSpec( git_commit="abc123", framework_version="0.1.0", ), ) print(spec.fingerprint) # "a1b2c3d4..."
fingerprint
property
¶
Compute SHA-256 fingerprint of the decision spec.
This fingerprint uniquely identifies the configuration. Two runs with the same fingerprint should produce statistically similar results.
fingerprint_short
property
¶
Short version of fingerprint (first 12 characters).
is_compatible_with(other)
¶
Check if two specs are compatible for comparison.
Two specs are compatible if they have the same model and agent, even if prompts or environment differ. This is useful for comparing prompt changes while keeping other factors constant.
diff(other)
¶
Compare two specs and return differences.
Returns dict mapping field names to (self_value, other_value) tuples for fields that differ.
to_summary()
¶
Create human-readable summary.
tracelens.ModelConfig
¶
Bases: BaseModel
Configuration for the LLM model used.
Captures provider, model version, and all decoding parameters that could affect output.
to_hash_dict()
¶
Return dict of fields that affect output (for hashing).
tracelens.PromptSpec
¶
Bases: BaseModel
Specification of prompts used in the agent.
Stores hashes of prompt templates for traceability without storing the full prompts (which may be long or sensitive).
from_prompts(system_prompt=None, prompt_template=None, prompt_version=None, store_full_prompts=False)
classmethod
¶
Create PromptSpec from actual prompts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
system_prompt
|
str | None
|
The system prompt text |
None
|
prompt_template
|
str | None
|
The prompt template text |
None
|
prompt_version
|
str | None
|
Optional version identifier |
None
|
store_full_prompts
|
bool
|
Whether to store full prompts (default False) |
False
|
to_hash_dict()
¶
Return dict of fields for hashing.
tracelens.ToolSpec
¶
Bases: BaseModel
Specification of a tool available to the agent.
Captures tool identity and version for reproducibility.
to_hash_dict()
¶
Return dict for hashing.
tracelens.AgentSpec
¶
Bases: BaseModel
Specification of the agent being evaluated.
Captures agent identity, version, and graph structure.
to_hash_dict()
¶
Return dict for hashing.
tracelens.InfraConfig
¶
Bases: BaseModel
Infrastructure / runtime-environment configuration.
Agentic evals are end-to-end system tests: the runtime environment is part of the problem-solving process. Resource limits, time budgets, and concurrency all influence what strategies an agent can use, which means infrastructure configuration is a first-class experimental variable — not passive scaffolding.
Anthropic's "Quantifying infrastructure noise in agentic coding evals" (Feb 2026) showed that infrastructure config alone can swing Terminal-Bench 2.0 scores by ~6 percentage points, often more than the leaderboard gap between frontier models. Their recommendations are baked into the shape of this spec:
- Record both a guaranteed allocation and a separate hard kill
threshold, per task (see
cpu_guaranteed/cpu_hard_limitandmemory_guaranteed_mb/memory_hard_limit_mb). Pinning them to the same value leaves zero headroom for transient spikes and produces spurious infra failures. - Capture the sandboxing provider, because enforcement semantics differ across runtimes (Kubernetes vs. Docker vs. Fly.io vs. bare containers).
- Keep observational fields (
hostname,container_id,wall_clock_start_utc) out of the fingerprint so two runs with identical resource configs on different hosts collide to the same fingerprint.
See: https://www.anthropic.com/engineering/infrastructure-noise
to_hash_dict()
¶
Return dict of behavior-affecting fields for hashing.
Observational fields (hostname, container_id,
wall_clock_start_utc) are intentionally excluded: two runs
with identical resource configs on different hosts should
collide to the same fingerprint.
tracelens.EnvironmentSpec
¶
Bases: BaseModel
Specification of the execution environment.
Captures build/deployment information for traceability.
to_hash_dict()
¶
Return dict for hashing.
LLM judge providers¶
tracelens.LLMProvider
¶
Bases: ABC
Abstract base class for LLM providers.
complete(prompt, **kwargs)
abstractmethod
async
¶
Send a prompt to the LLM and return the text response.
tracelens.InMemoryProvider
¶
Bases: LLMProvider
Testing provider that returns canned responses.
Cycles through the provided responses list and records all prompts.
tracelens.create_provider(model_or_alias, **kwargs)
¶
Create an LLM provider from an alias.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_or_alias
|
str
|
Currently only |
required |
**kwargs
|
Any
|
Passed to the provider constructor. For |
{}
|
Returns:
| Type | Description |
|---|---|
LLMProvider
|
An LLMProvider instance. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Reporting¶
tracelens.ReportGenerator
¶
Generates evaluation reports from TrialBatch results.
Example
gen = ReportGenerator() report = gen.build_report(batch) print(gen.render_markdown(report))
build_report(batch, baseline_manager=None)
¶
Build a ReportData from a TrialBatch.
render_markdown(report)
¶
Render a human-readable markdown report.
render_ci_summary(report)
¶
Render a compact CI-friendly summary.
render_html(report)
¶
Render a self-contained HTML dashboard with inline CSS and SVG charts.
tracelens.ReportData
dataclass
¶
Complete evaluation report.
tracelens.TaskSummary
dataclass
¶
Per-task summary statistics.