Backtesting AI Agents: Replay to Catch Regressions

Q: Do I need an API key, or does it use my Claude Code subscription?

The reference CLI uses @anthropic-ai/claude-agent-sdk, which authenticates against the local Claude Code install. No ANTHROPICAPIKEY is required if you are signed in to Claude Code already. The replay and judge calls run against the same credentials your editor uses, on the same plan.

Q: Can I replay sessions that wrote files?

Yes, but pin the replay's cwd to a git worktree at the baseline commit. Otherwise Read returns post-baseline files and the replay is contaminated by future state. The simplest version is a one-line script that creates a worktree at the baseline commit, runs the replay, and discards the worktree afterwards.

54% of enterprises now run AI agents in production, and 41% rank unreliable performance as the top adoption blocker (Master of Code, 2026). The interesting part isn’t the size of the problem. It’s the shape. The gap isn’t “the model is bad.” It’s that we ship config changes across four context surfaces (a new CLAUDE.md rule, an installed skill, a model swap, an MCP server upgrade) and find out what they break from a confused user.

Quant finance ran into the analogous problem thirty years ago. They called it backtesting: never test a strategy on a single price, test it on a known history under varying assumptions. The tooling for agent backtesting almost exists. The transcripts are already on disk. The judges already work. We just have not named the discipline yet.

This post is the naming, the four-step loop, and a working reference CLI. The worked example is a real session recorded in this site’s repo.

Key Takeaways

54% of enterprises run AI agents in production, and 41% rank unreliable performance as their top blocker (Master of Code, 2026). Catching regressions before they ship is the missing eval discipline.

Backtesting an agent: record a real session, replay it under a new configuration, diff the outputs structurally, and ask an LLM judge to score the delta.

Enterprise agentic systems show a 37% gap between lab benchmark scores and real-world deployment performance (Ampcome, 2026). Replaying real sessions closes the gap that synthetic benchmarks open.

The reference implementation is agent-backtest, a Bun CLI that runs against the JSONL transcripts Claude Code already writes. Worked example included: stripping the system context from a real session collapsed correctness from baseline to 0 out of 5.

Backtesting does not replace human review. It shifts where the human spends attention, from “did the suite pass” to “did the agent’s behavior change in ways the suite cannot see.”

Why is agent evaluation an unsolved problem in 2026?

41% of teams running agents in production rank unreliable performance as their biggest adoption blocker. Only 39.8% currently use offline evaluation, and 32.5% run real-time A/B testing (Master of Code, 2026). Lab benchmarks miss real failures by a 37% margin (Ampcome, 2026). The discipline is real; the coverage is the gap.

Three things make agent evaluation different from earlier eval problems.

Agents are non-deterministic by construction. Identical inputs produce different execution paths, because the model picks tools, gets results, and reasons over them in ways that compound across turns. A unit test asks “given X, do I get Y?” Agents do not have that shape.

Multi-turn flows accumulate decisions. A bad first tool choice biases the next three turns. A subtle context loss in turn five becomes the wrong final answer in turn nine. The interesting bugs live in the trajectory, not in any single response.

Standard observability sees traces, not meaning. Spans return 200. Latency is fine. The user-facing decision quality has dropped, and your dashboards show green. This is the silent-regression failure mode, and it is the one teams miss most often after a CLAUDE.md edit nobody flagged.

The shape of the gap explains the data. Of the teams already running agents, 55.4% use tracing, 44.3% use guardrails, but only 39.8% use offline evaluation (Master of Code, 2026). Tracing tells you what happened. Eval tells you whether what happened was good. The lower number is where the regressions hide.

What does backtesting an agent actually look like?

Backtesting an agent is a four-step loop. Record a real session as a stable transcript. Replay it under a new configuration. Diff the outputs structurally. Ask an LLM judge to score the delta. Each step alone is necessary; none are sufficient on their own.

The quant analogy is the load-bearing one. They call it backtesting because the test direction runs against known history, not against forecasted futures. A new strategy is rerun on years of price data; if it would have lost money in 2008, it does not get capital today. Agent backtesting is the same direction of test: take a session that already happened, rerun it under a candidate configuration, see whether the candidate would have made the same call.

I wrote about the foundational backtesting technique for application-level temporal systems in a prior post. The same shape applies to agents, with one structural difference: the input is not a synthetic event stream but a normalized recording of an agent run.

The reason the judge step matters at all is that structural diffs are necessary but insufficient. Tool sequence and token overlap can both pass while a real regression passes through. A 2025 LLM-as-judge survey found this is precisely where rubric-based judging earns its keep, especially during regression testing after model or prompt updates (LLM-as-a-Judge survey, 2025).

Each step has its own failure modes. The rest of the post walks them.

How do you record an agent session deterministically?

Claude Code already writes a JSONL transcript per session at ~/.claude/projects/<encoded-cwd>/<session-id>.jsonl. The recording step is normalization, not capture. The interesting work is what you strip and what you preserve.

Three things to strip. Timestamps, because they make every diff between runs noisy. Synthetic command envelopes (<local-command-caveat>, <command-name>, <command-message>), because they look like user prompts but they are slash command bookkeeping the harness injects. If you keep them, your replay starts from the wrong assumption every time. Raw tool inputs, because they contain absolute file paths and snapshots that change between runs. Replace each with a stable hash so the diff has something to compare without false positives.

Four things to preserve. The filtered user prompts (everything that survived the envelope strip). The assistant turns with text and tool-call sequences. The tool result hashes so you can detect when the same tool returned different results. The environment fingerprint: Claude Code version, gitBranch, cwd, model name, and a hash of the CLAUDE.md file. Without the fingerprint, you cannot tell whether a future replay is fair or contaminated. Stanford’s research on Agentic Context Engineering puts a number on the cost of unmanaged context: incremental, structured updates reduce drift by up to 86% compared to unmanaged approaches (arXiv, 2025). Fingerprinting is the smallest possible version of that discipline.

A normalized record from agent-backtest, against a real session in this repo, looks like this:

{
  "schemaVersion": 1,
  "id": "site-external-links",
  "source": {
    "tool": "claude-code",
    "version": "2.1.114",
    "originalSessionId": "5a2f208f-...",
    "cwd": "/path/to/your-repo",
    "gitBranch": "master",
    "model": "claude-opus-4-7"
  },
  "config": { "claudeMdHash": "a3f1c0..." },
  "turns": [
    { "role": "user", "text": "make sure to open external links in new tab" },
    { "role": "assistant", "text": "", "toolCalls": [{ "name": "Grep", "inputHash": "288ba126" }] }
  ]
}

That shape is enough to feed any of the next three steps. It is also enough to reason about what the recording is, and is not, evidence for. Read on for the part most people get wrong.

Why do structural diffs miss real regressions?

Structural diffs flag drift. They do not classify it. Two replays of the same prompt can produce equally good plans with different wording; one can be slightly better, one slightly worse, and a token-overlap metric will not tell you which is which. A real regression can pass overlap thresholds. A false alarm can fail them.

The discipline is to treat the structural diff as the floor signal and the LLM judge as the ceiling. The floor is cheap: tool sequence equality, final-text Jaccard overlap, length delta. The ceiling is more expensive but bounded: a rubric with three or four axes, an LLM judge that scores them on a small integer scale, a regression flag with three buckets (“none”, “minor”, “major”). Run the floor first, run the ceiling on anything the floor flagged or anything that is going into a release.

Two failure modes worth naming. The first is single-shot scoring. A single judge run is volatile. Score N=3 minimum, take the median, record the variance. The variance itself is signal: if the judge cannot agree with itself, the output is genuinely ambiguous and probably needs a human read.

The second is same-family judging. ICLR 2026’s preference-leakage paper showed that LLM judges from the same family as the generator overrate their relatives, sometimes substantially (Preference Leakage, 2025). Where you can, judge with a different family than the one you replayed. Where you cannot, record the judge model in every verdict so the bias is at least auditable.

Our finding: The most useful thing about a judge verdict is not the score; it is the rationale. A judge that returns correctness: 2 and the sentence “the replay only produced a plan and did not actually implement the changes” is doing two jobs at once: classifying the regression and telling you which axis to investigate. Treat the score as a triage signal and the rationale as the artifact.

This is also the section where most “agent eval” posts stop. A judge run is not the end of the loop. A judge run is the start of a human read.

A worked example: stripping CLAUDE.md and replaying a real session

I recorded a real Claude Code session in the repo this post lives in. The original prompt was small and concrete: “make sure to open external links in new tab.” The baseline run, opus-4-7 with the full CLAUDE.md (design context, no-emdash rule, Bun preference, code-search priorities), did the obvious right thing. It found the markdown pipeline in astro.config.mjs, configured rehype-external-links to add target="_blank" rel="noopener noreferrer", audited the four hardcoded anchor sites in components, ran bun run build, and committed. Twelve tool calls in the first user-turn arc. Final summary: “Build succeeded and external links in MDX posts now render with target=\"_blank\" rel=\"noopener noreferrer\".”

Then I ran two replays under different configurations.

Replay A: faithful config, sandboxed environment

opus-4-7, full CLAUDE.md, but the replay was constrained: read-only tools (Read, Grep, Glob) and permissionMode: "plan". Tool sequence: 43 calls, much heavier than baseline. Final output: a careful plan that correctly identified the same four anchor sites. Token-overlap Jaccard against the baseline final text: 0.24.

Verdict: correctness 2 of 5, style 2 of 5, toolUse 1 of 5. Regression flag: major. The judge’s rationale: “the replay only produced a plan and did not actually implement the changes or run the build, whereas the baseline edited files and verified the build succeeded.”

The verdict is correct in the strict sense. The replay did not complete the task. It is also misleading. The model and the system prompt were faithful to the baseline. The reason the replay scored as a regression was the sandbox: the replay ran in plan mode with read-only tools, so the agent could not have completed the edit even if it understood the task perfectly. A sandbox mismatch will be scored as a regression. This is the methodological footgun. The verdict has to be read alongside the replay environment, not on its own.

Replay B: real config drift

haiku-4-5, stripped system prompt (no design context, no Bun preference, no engineering rules), same read-only tools. Tool sequence: zero. The agent did not call a single tool. Final output: a confused refusal that asked the user to exit plan mode and offered to “configure Claude Code to open external links in new tabs” as a settings change. Token-overlap Jaccard against the baseline: 0.078.

Verdict: correctness 0 of 5, style 1 of 5, toolUse 0 of 5. Regression flag: major. Rationale: “the replay misinterprets the request as a plan-mode configuration issue, makes no tool calls, and fails to address the user’s coding task entirely.”

That one is a real regression. The model and the missing system context made the agent unable to even understand the prompt as a coding task.

Reading the deltas

The shape of the regression tells you which axis broke. Replay A has a high tool count and a faithful plan: classic sandbox mismatch. Replay B has zero tool calls and a misread prompt: classic context loss. The score was the same; the underlying story was completely different.

This is also why N=1 is enough to learn from. The numbers exist; the methodology is what lets you read them.

What does a good backtest harness look like?

agent-backtest is the v0.1 reference. Four commands (record, replay, diff, judge) plus a run orchestrator. Bun and TypeScript. Authentication piggybacks on the local Claude Code install via @anthropic-ai/claude-agent-sdk, so there is no ANTHROPIC_API_KEY to manage and no separate billing surface. If you are signed into Claude Code, the harness inherits the credentials.

Five design choices carry most of the load.

Hashed tool inputs in the record format, not raw inputs. Raw inputs change between runs (timestamps, snapshot ids, absolute paths). Hashes collapse that to a stable fingerprint that is still sensitive to a real change.

Replays default to permissionMode: "plan" and allowedTools: ["Read", "Grep", "Glob"]. This means a replay cannot mutate the working tree by accident. Widening the surface is a per-replay flag, not a default. If you forget the flag, the worst thing that can happen is a misleading verdict, not a corrupted repo.

Diff truncates the baseline to the first user-turn arc. Baselines often contain follow-up prompts that the replay is not testing. Comparing a single-turn replay against a full multi-turn baseline produces inflated tool counts and meaningless overlap deltas. Slice the baseline to match the replay’s scope.

Judge is a separate command emitting structured verdicts to its own directory. You can re-judge without re-replaying. Judge runs are the cheapest part of the loop; replay runs are the most expensive. Decoupling them means you can iterate on the rubric without reburning the API budget.

One small JSON summary per run, durable artifact on disk. Same shape as the application-backtesting harness from the prior post: regression thresholds in the runner, not the prompt. The agent (or the human) reads one JSON, opens the rationale only when a flag fires, and never tails a log.

A run invocation chains all four:

agent-backtest run \
  --input ~/.claude/projects/<encoded-cwd>/<session>.jsonl \
  --id site-external-links \
  --model claude-haiku-4-5-20251001 \
  --system-file prompts/stripped.md \
  --label haiku-stripped \
  --max-turns 10 \
  --cwd /path/to/your-repo \
  --allowed-tools Read,Grep,Glob

Everything the harness produces lives under recordings/, replays/, diffs/, verdicts/. Four directories, four shapes, no hidden state. The whole stack is the kind of thing a CLAUDE.md and a few skills can drive end-to-end if you want it on a schedule.

Where does backtesting fall down?

Backtesting can fail badly when used badly, because the verdict reads authoritative even when the methodology is wrong. Six anti-patterns are worth naming explicitly.

Sandbox-as-baseline. Comparing a sandboxed replay (read-only, plan mode) against a fully-executing baseline produces false-positive regressions. Either ground the replay environment to the baseline’s tool surface, or call out the sandbox in the verdict so the score is read against the right denominator.

One-shot scoring. A single judge run is volatile. Score N=3 minimum, take the median, record the variance. If the judge cannot agree with itself, the output is ambiguous and probably needs a human read.

Same-family judge. Preference leakage means judges from the same family as the generator overrate the generator (Preference Leakage, 2025). Cross-family judging where you can. Record the judge model in every verdict regardless.

Replay drift from environment. If the agent reads files, the files change between baseline date and replay date. Pin the replay’s cwd to a git worktree at the baseline commit. Otherwise Read returns post-baseline files and the replay is contaminated by future state. The same logic applies to MCP layers that change between turns: pin the MCP version too.

Treating the score as ground truth. The verdict is a triage signal. Major regressions deserve a human read of the diff. An automatic alert that rolls back a deploy because the judge said major is exactly the failure mode you are pretending to prevent.

Backtesting as the only eval. It tells you whether a config change made this session worse. It cannot tell you what a new prompt category will do. Run it alongside, not instead of, your forward-looking eval suite. The 2026 LLM-regression-testing literature is consistent on this point: gold sets and replay loops are complements, not alternatives (RAG triad and gold sets in 2026, 2026).

The compounding effect of these anti-patterns is that a badly-shaped harness produces more confidence in worse signal, which is the worst possible direction for an evaluation system to drift. Build the harness with the limits on the same page as the verdict.

Frequently Asked Questions

How is this different from agent eval suites like LangSmith or Braintrust?

Eval suites run new prompts against benchmark expectations. Backtesting replays real production sessions against config variants. They are complementary, not substitutional. A 2026 observability survey put it well: reliable agents need unit evals on discrete steps, LLM-as-judge regression suites for subjective output quality, and continuous production trace sampling to catch real-world drift (Agent Observability 2026, 2026). Backtesting fills the second slot, against the sessions the first two have already touched.

Do I need an API key, or does it use my Claude Code subscription?

The reference CLI uses @anthropic-ai/claude-agent-sdk, which authenticates against the local Claude Code install. No ANTHROPIC_API_KEY is required if you are signed in to Claude Code already. The replay and judge calls run against the same credentials your editor uses, on the same plan.

Can I replay sessions that wrote files?

Yes, but pin the replay’s cwd to a git worktree at the baseline commit. Otherwise Read returns post-baseline files and the replay is contaminated by future state. The simplest version is a one-line script that creates a worktree at the baseline commit, runs the replay, and discards the worktree afterwards.

How many sessions should I replay?

Start with one. The post’s worked example is N=1 and was already enough to show a major regression on a configuration change, plus a methodological footgun nobody had warned me about. Scale up the catalog as the replays earn their keep. A useful rule: every time a CLAUDE.md change introduces a regression in production, capture the offending session, add it to the replay set, and never let that class of regression land again.

What should the rubric measure?

For most agent work, three axes are enough: correctness (did the replay address the prompt as well as the baseline?), style (did the tone, structure, and terseness drift?), and tool use (did the replay choose tools sensibly given the same evidence?). Add a fourth axis only if you have a domain reason. More axes do not produce more signal; they produce more variance.

The model upgrade we don’t see coming

The model upgrade we do not see coming is the one that has already shipped. The next time a CLAUDE.md rule gets a small edit, a skill installs, a model swaps, an MCP server upgrades: agent behavior changes. Without backtesting, the first signal is a confused user. With backtesting, the signal is a verdict file in a directory next to the diff.

Quant finance figured this out by building the discipline of running new strategies against old market histories before risking capital. Agent teams have the analogous infrastructure already on disk: the JSONL transcripts Claude Code writes, the judge models we already use for evals, the MCP layer that already records what it returned. We just have not named the practice or pointed the tools at each other.

This post is the naming. agent-backtest is the working reference. The interesting work from here is everyone else’s catalog of replays growing faster than ours.