Reproducibility & DecisionSpec¶

Two evaluation runs with the "same" setup can still diverge. A different model snapshot, a tweaked system prompt, a tool that quietly bumped its schema, or even a tighter memory budget on the CI runner can all move your pass rate. When that happens, the question is always the same: did the agent regress, or did the environment change underneath it? If you never recorded the configuration that produced a result, you can't answer it.

DecisionSpec is TraceLens's answer. It records the model, prompts, tools, agent, infrastructure, and environment that produced a run, and reduces them to a single content hash — the fingerprint. Identical configurations produce identical fingerprints; any behavior-affecting change produces a different one. Baselines stamped with a fingerprint become attributable: a regression carries the exact config that produced it.

from tracelens import DecisionSpec, ModelConfig, PromptSpec, ToolSpec, AgentSpec

spec = DecisionSpec(
    model=ModelConfig(provider="anthropic", model_id="claude-3-opus-20240229", temperature=0.7),
    prompts=PromptSpec.from_prompts(system_prompt="You are a helpful assistant..."),
    tools=[ToolSpec(name="search", version="1.0")],
    agent=AgentSpec(agent_name="goal_decomposition", agent_version="1.0.0"),
)
print(spec.fingerprint)        # full SHA-256 hex digest
print(spec.fingerprint_short)  # first 12 chars, e.g. "a1b2c3d4e5f6"

Every sub-config is exported from the package root (from tracelens import DecisionSpec, ModelConfig, ...), and every field below is optional unless the table marks it required.

The sub-configs¶

A DecisionSpec composes six optional sub-configs plus two top-level fields (global_seed: int | None and extra: dict[str, Any]). All are optional — fill in only what is meaningful for your setup. The more you populate, the more precisely two runs can be told apart.

ModelConfig¶

What model produced the output, and with what decoding parameters.

Field	Type	Notes
`provider`	`str`	Required. e.g. `"anthropic"`, `"openai"`, `"google"`
`model_id`	`str`	Required. e.g. `"claude-3-opus-20240229"`
`model_version`	`str \\| None`	Specific snapshot if available
`temperature`	`float \\| None`	`0.0` = deterministic
`top_p`	`float \\| None`	Nucleus sampling
`top_k`	`int \\| None`	Top-k sampling
`max_tokens`	`int \\| None`	Max tokens to generate
`seed`	`int \\| None`	Provider seed, if supported
`stop_sequences`	`list[str] \\| None`	Sorted before hashing (order-independent)
`extra_params`	`dict[str, Any]`	Any additional model-specific params

PromptSpec¶

What prompts the agent ran. By default PromptSpec stores only SHA-256 hashes of prompt text, not the prompts themselves — so a fingerprint is traceable without committing long or sensitive prompts to your baseline files.

Field	Type	Notes
`system_prompt_hash`	`str \\| None`	SHA-256 of the system prompt
`prompt_template_hash`	`str \\| None`	SHA-256 of the main template
`prompt_version`	`str \\| None`	Version identifier for the prompt set
`system_prompt`	`str \\| None`	Full text (optional, debugging only)
`prompt_template`	`str \\| None`	Full text (optional, debugging only)

Use the from_prompts(...) classmethod rather than hashing by hand — it computes the hashes for you:

PromptSpec.from_prompts(
    system_prompt="You are a helpful assistant...",
    prompt_template="Given {context}, do {task}...",
    prompt_version="v3",
    store_full_prompts=False,  # default: store hashes only
)

Only system_prompt_hash, prompt_template_hash, and prompt_version enter the fingerprint. The optional full-text fields are for debugging and never affect the hash.

ToolSpec¶

A single tool available to the agent. Pass a list of these as DecisionSpec.tools. Tools are sorted by name before hashing, so declaration order doesn't matter.

Field	Type	Notes
`name`	`str`	Required. Tool name
`version`	`str \\| None`	Tool version
`description_hash`	`str \\| None`	Hash of the tool description (affects LLM tool selection)
`schema_hash`	`str \\| None`	Hash of the tool input/output schema

AgentSpec¶

Which agent was evaluated, and which version of its wiring.

Field	Type	Notes
`agent_name`	`str`	Required. Name of the agent
`agent_version`	`str \\| None`	Agent version
`agent_graph_hash`	`str \\| None`	Hash of agent graph structure (multi-agent systems)
`config_hash`	`str \\| None`	Hash of agent configuration

InfraConfig¶

The runtime environment — resource limits, time budgets, concurrency, and platform. Agentic evals are end-to-end system tests, so infrastructure is a first-class experimental variable, not passive scaffolding: record a guaranteed allocation and a separate hard kill threshold per resource, and capture the sandbox provider because enforcement semantics differ.

Behavior-affecting fields (included in the fingerprint):

Field	Type	Notes
`cpu_guaranteed`	`float \\| None`	Guaranteed CPU in whole cores
`cpu_hard_limit`	`float \\| None`	CPU kill threshold in whole cores
`memory_guaranteed_mb`	`int \\| None`	Guaranteed memory (MB)
`memory_hard_limit_mb`	`int \\| None`	OOM threshold (MB)
`time_budget_seconds`	`float \\| None`	Per-task wall-clock budget
`concurrency_level`	`int \\| None`	Max concurrent trials
`runtime_platform`	`str \\| None`	e.g. `"kubernetes"`, `"docker"`, `"local"`
`sandbox_provider`	`str \\| None`	Sandboxing provider
`harness_version`	`str \\| None`	Version of the eval harness

Observational fields (deliberately excluded from the fingerprint):

Field	Type	Notes
`hostname`	`str \\| None`	Host running the eval
`container_id`	`str \\| None`	Container / pod ID
`wall_clock_start_utc`	`datetime \\| None`	UTC start time

The observational fields are excluded so that two runs with identical resource configs on different hosts collide to the same fingerprint — see The fingerprint below.

EnvironmentSpec¶

Build and deployment provenance for traceability.

Field	Type	In fingerprint?
`git_commit`	`str \\| None`	Yes
`git_branch`	`str \\| None`	No
`build_id`	`str \\| None`	Yes
`runner_version`	`str \\| None`	Yes
`framework_version`	`str \\| None` (TraceLens version)	Yes
`python_version`	`str \\| None`	No

git_branch and python_version are recorded for auditing but do not enter the fingerprint.

The fingerprint¶

DecisionSpec.fingerprint is the full SHA-256 hex digest; fingerprint_short is its first 12 characters. Both are computed deterministically: each sub-config contributes a hash dict, the combined dict is serialized with sort_keys=True, and that is hashed. The guiding rule:

Config goes in. Observations stay out. Anything that changes what the agent does belongs in the fingerprint. Anything that merely records where or when it ran does not — so the same configuration on a different machine yields the same fingerprint.

That is why InfraConfig.hostname, container_id, and wall_clock_start_utc, plus EnvironmentSpec.git_branch and python_version, are excluded.

Different config → different fingerprint:

a = DecisionSpec(prompts=PromptSpec.from_prompts(system_prompt="Be terse."))
b = DecisionSpec(prompts=PromptSpec.from_prompts(system_prompt="Be verbose."))
assert a.fingerprint != b.fingerprint   # the prompt changed

Same config, different host → same fingerprint:

from tracelens import InfraConfig
cfg = dict(memory_hard_limit_mb=2048, runtime_platform="kubernetes")
run_a = DecisionSpec(infra=InfraConfig(**cfg, hostname="node-7"))
run_b = DecisionSpec(infra=InfraConfig(**cfg, hostname="node-12"))
assert run_a.fingerprint == run_b.fingerprint   # only the hostname differs

One compatibility detail: infra is only mixed into the hash when it is explicitly set, so fingerprints recorded before InfraConfig existed stay stable.

Two helpers complement the fingerprint: spec.diff(other) returns the field-level differences as {field: (self_value, other_value)}, and spec.is_compatible_with(other) returns True when two specs share the same model provider/model_id and agent name — useful for comparing a prompt change while holding the model fixed.

Stamping a run¶

You rarely build fingerprints by hand. Pass a decision_spec to the EvaluationRunner and it stamps the spec onto every transcript that doesn't already carry one:

from tracelens import EvaluationRunner, RunnerConfig, SimpleAdapter, DecisionSpec, ModelConfig

runner = EvaluationRunner(
    adapter=SimpleAdapter(my_agent),
    graders=[...],
    config=RunnerConfig(num_runs=5),
    decision_spec=DecisionSpec(
        model=ModelConfig(provider="anthropic", model_id="claude-3-opus-20240229"),
    ),
)
batch = await runner.run(eval_set)

If an adapter already attached its own decision_spec to a transcript, the runner leaves it untouched; otherwise it fills in the runner-level spec. The result: every trial in the batch records the configuration that produced it, so any baseline you promote from this batch carries its fingerprint.

Infra-noise-aware regression¶

Because InfraConfig is part of the fingerprint, the regression detector can tell "the agent/model/prompt changed" apart from "only the infrastructure changed." This matters: resource-config changes alone can swing agentic-eval scores by several percentage points — often more than the gap between frontier models.

RegressionDetector.compare_with_specs(...) takes both the baseline and current specs. When their InfraConfig differs, it sets report.infra_config_mismatch (with the field-level report.infra_config_diff) and marks sub-noise-band regressions as within_noise_band=True, so default CI gates don't block on a delta that is plausibly just infra noise:

report = RegressionDetector(min_delta_percent=1.0).compare_with_specs(
    baseline,
    current_results=[{"pass_rate": current_rate}] * 6,
    baseline_spec=baseline_spec,   # DecisionSpec(infra=InfraConfig(memory_hard_limit_mb=2048, ...))
    current_spec=current_spec,     # DecisionSpec(infra=InfraConfig(memory_hard_limit_mb=512,  ...))
)

print(report.infra_config_mismatch)            # True — memory budget changed
for key, (was, now) in report.infra_config_diff.items():
    print(f"  {key}: {was!r} -> {now!r}")       # memory_hard_limit_mb: 2048 -> 512
print(len(report.blocking_regressions))        # excludes within_noise_band ones
print(report.should_block_ci())

report.blocking_regressions filters out anything flagged within_noise_band, and should_block_ci() keys off that filtered set. The full runnable example — including an InfraError-raising adapter that simulates an OOM kill so infra failures are classified separately from task failures — is at examples/noise_aware_regression.py.