Supported Evaluation Scenarios¶

TraceLens is built for agent evaluation workflows where you need to inspect what happened, score the outcome, compare it against a baseline, and gate CI when reliability regresses.

If you are comparing TraceLens to hosted tracing, prompt-management, benchmark, or RAG-eval tools, read TraceLens vs Adjacent Tools after this page.

Good Fits¶

Task-Level Agent Evaluation¶

Use TraceLens when one task maps to one agent run:

Goal decomposition.
Code generation or code review.
Research synthesis.
Customer-support response generation.
Planning, routing, or tool-selection decisions.

Start with Task, an AgentAdapter, one or more graders, and EvaluationRunner.

Regression Testing For Agents¶

Use baselines when you have a known-good agent or prompt and want to prevent silent quality loss:

Compare pass rate, quality score, safety score, or latency to a baseline.
Keep canary baselines protected.
Auto-promote capability baselines when a new version improves enough.
Fail CI on moderate or severe regressions.

Start with BaselineManager, RegressionDetector, and docs/ci-cd-integration.md.

Inspectable Trace Review¶

Use transcripts when debugging why a run passed or failed:

Final output.
Intermediate transcript steps.
Timing.
Decision fingerprint.
Grader feedback.

Start with Transcript, TranscriptStep, and the report generator.

Infrastructure-Noise-Aware Evals¶

Use DecisionSpec.InfraConfig when resource differences can change the result:

Memory limits.
CPU limits.
Timeout budgets.
Concurrency.
Harness or sandbox provider.

TraceLens can mark small regressions as within the infrastructure noise band when the agent config is unchanged but infra changed.

Start with examples/noise_aware_regression.py and benchmarks/high-stakes-autonomous/.

HTTP Agent Evaluation¶

Use HTTPAPIAdapter when the system under test is already exposed as a service:

Local FastAPI or Flask apps.
Hosted agent APIs.
Internal platform endpoints.
Agent wrappers that already speak JSON.

Start with examples/http_agent_eval.py and install tracelens[http].

Contract-Based Evaluation¶

Use behavior contracts when your expected behavior is easy to declare:

Required output schema.
Must-include or must-not-include fields.
Safety constraints.
Quality warnings.
Tracking-only metrics.

Start with examples/contract_eval.py.

Less Good Fits¶

TraceLens is not trying to be:

A model benchmark leaderboard.
A full observability backend.
A prompt management platform.
A replacement for human review.
A hosted evaluation service.

You can still export TraceLens outputs into those systems, but the core package focuses on local, inspectable, CI-ready evaluation.

Choosing A First Example¶

You have...	Start with
A Python function or local agent	`examples/hello_world.py`
A JSON HTTP endpoint	`examples/http_agent_eval.py`
A strict behavioral contract	`examples/contract_eval.py`
A baseline and new candidate run	`examples/noise_aware_regression.py`
A safety/resource-sensitive workflow	`benchmarks/high-stakes-autonomous/`

Minimum Production Setup¶

For a real project, aim for:

20-50 real tasks collected from actual failures or important workflows.
At least one must-pass grader for safety or format constraints.
At least one score-contributor grader for quality.
num_runs > 1 for non-deterministic agents.
A canary baseline for critical behavior.
CI that fails on blocking regressions.
A habit of reading transcripts when a metric changes.

Ready to start?¶

If TraceLens fits, the fastest path is Getting Started (5 min). If you'd rather understand the moving parts first, read Core Concepts & Glossary.