Skip to content

Supported Evaluation Scenarios

TraceLens is built for agent evaluation workflows where you need to inspect what happened, score the outcome, compare it against a baseline, and gate CI when reliability regresses.

If you are comparing TraceLens to hosted tracing, prompt-management, benchmark, or RAG-eval tools, read TraceLens vs Adjacent Tools after this page.

Good Fits

Task-Level Agent Evaluation

Use TraceLens when one task maps to one agent run:

  • Goal decomposition.
  • Code generation or code review.
  • Research synthesis.
  • Customer-support response generation.
  • Planning, routing, or tool-selection decisions.

Start with Task, an AgentAdapter, one or more graders, and EvaluationRunner.

Regression Testing For Agents

Use baselines when you have a known-good agent or prompt and want to prevent silent quality loss:

  • Compare pass rate, quality score, safety score, or latency to a baseline.
  • Keep canary baselines protected.
  • Auto-promote capability baselines when a new version improves enough.
  • Fail CI on moderate or severe regressions.

Start with BaselineManager, RegressionDetector, and docs/ci-cd-integration.md.

Inspectable Trace Review

Use transcripts when debugging why a run passed or failed:

  • Final output.
  • Intermediate transcript steps.
  • Timing.
  • Decision fingerprint.
  • Grader feedback.

Start with Transcript, TranscriptStep, and the report generator.

Infrastructure-Noise-Aware Evals

Use DecisionSpec.InfraConfig when resource differences can change the result:

  • Memory limits.
  • CPU limits.
  • Timeout budgets.
  • Concurrency.
  • Harness or sandbox provider.

TraceLens can mark small regressions as within the infrastructure noise band when the agent config is unchanged but infra changed.

Start with examples/noise_aware_regression.py and benchmarks/high-stakes-autonomous/.

HTTP Agent Evaluation

Use HTTPAPIAdapter when the system under test is already exposed as a service:

  • Local FastAPI or Flask apps.
  • Hosted agent APIs.
  • Internal platform endpoints.
  • Agent wrappers that already speak JSON.

Start with examples/http_agent_eval.py and install tracelens[http].

Contract-Based Evaluation

Use behavior contracts when your expected behavior is easy to declare:

  • Required output schema.
  • Must-include or must-not-include fields.
  • Safety constraints.
  • Quality warnings.
  • Tracking-only metrics.

Start with examples/contract_eval.py.

Less Good Fits

TraceLens is not trying to be:

  • A model benchmark leaderboard.
  • A full observability backend.
  • A prompt management platform.
  • A replacement for human review.
  • A hosted evaluation service.

You can still export TraceLens outputs into those systems, but the core package focuses on local, inspectable, CI-ready evaluation.

Choosing A First Example

You have... Start with
A Python function or local agent examples/hello_world.py
A JSON HTTP endpoint examples/http_agent_eval.py
A strict behavioral contract examples/contract_eval.py
A baseline and new candidate run examples/noise_aware_regression.py
A safety/resource-sensitive workflow benchmarks/high-stakes-autonomous/

Minimum Production Setup

For a real project, aim for:

  • 20-50 real tasks collected from actual failures or important workflows.
  • At least one must-pass grader for safety or format constraints.
  • At least one score-contributor grader for quality.
  • num_runs > 1 for non-deterministic agents.
  • A canary baseline for critical behavior.
  • CI that fails on blocking regressions.
  • A habit of reading transcripts when a metric changes.

Ready to start?

If TraceLens fits, the fastest path is Getting Started (5 min). If you'd rather understand the moving parts first, read Core Concepts & Glossary.