← All concepts

Paper Compute Concept

Agent Session Replay: Reconstructing AI Agent Runs from Captured Session Records

Logs tell you an event fired. Replay lets you re-read the captured prompts, responses, tool-use events, and results that produced it — the only honest way to debug a system whose same input doesn't always produce the same output.

Published May 19, 2026
Replay Debugging Agents Telemetry

Definition

Agent session replay is the ability to step through a captured agent session — prompts, responses, tool-use events, tool results, and metadata — in its original sequence so an engineer or another agent can reconstruct what happened in a specific run.

For non-deterministic systems like AI agents, replay isn’t a convenience. It’s the only honest way to answer the question that matters: what did the agent actually do?

A debugger inspects a live program. Replay inspects the record of a past run. When the terminal has scrolled, the context window has closed, and the agent won’t reproduce the same behavior twice, replay is what survives.

Agent session replay does not make an AI run deterministic. It makes the captured record inspectable.

What agent session replay reveals beyond raw logs

What replay gives you
Sequence Captured turns in the order they happened, so the agent-facing flow can be inspected as a timeline.
Stable inspection The recorded sequence produces the same view every time; re-execution is a separate action.
Checkpoints Specific points in the captured sequence can be inspected, exported, or used as comparison anchors.

Why agent session replay matters for debugging non-deterministic systems

Agents are non-deterministic by default. The same prompt produces different tool calls on different runs. A failing session and a succeeding session, on paper, can look identical. Without a record you can play back, debugging is guesswork.

Logs can tell you where the agent ended up. Replay shows you how it got there.

This matters most at the exact moment when it’s hardest to reconstruct by hand: when a session has crashed, timed out, or produced an expensive loop. The relevant information has already scrolled off the terminal. The context window is closed. The only surviving record is the one the capture layer wrote before the failure.

How agent session replay differs from logs and distributed tracing

Logs, tracing, and replay are different artifacts for different jobs.

Three layers, three jobs
LayerCapturesAnswers
LogsDiscrete events with timestamps"Did X happen?"
TracingRequest spans with parent-child links"Where was the time spent?"
ReplayThe captured behavioral record as a sequence"What did the agent do, and what context led there?"

Logs and traces are excellent for events, timing, and system boundaries. Replay is different because it preserves the agent-facing sequence in a form an engineer can step through — which is what matters for non-deterministic agents whose runs don’t reproduce on demand.

Agent session replay vs chat history

Chat history is usually a user-facing transcript. It may omit tool results, metadata, retries, timing, provider payload details, or structured identifiers.

Agent session replay is an engineering artifact. It preserves the captured sequence as structured data so engineers can inspect, search, compare, export, and audit what happened during a run. Chat history helps a user remember a conversation. Replay helps a team investigate behavior.

What an agent session replay looks like step by step

Replaying a session
Session 9be94ee3...
├── [0:00] user: "fix the Kafka topic bug"
├── [0:02] assistant: identifies likely error pattern
├── [0:04] tool: Read src/kafka/producer.ts
├── [0:07] tool: Bash "kafka-topics --list"
├── [0:09] assistant: topic missing — will create
├── [0:12] tool: Bash "kafka-topics --create ..."
├── [0:14] tool_result: topic created
├── [0:15] tool: Bash "npm test"
├── [0:45] tool_result: 5 passed, 0 failed
└── [0:47] assistant: resolved

Replay systems can let you step forward or backward through the captured sequence, and content-addressed systems can support jumping to specific records by hash. From there you can export the transcript for analysis, feed it to a skill extractor, or compare it against another session — a successful run, a different model, a different prompt — to see exactly where behavior diverged.

Replay is not re-execution

Agent session replay shows the run that actually happened. It does not automatically re-run prompts, repeat tool calls, or reproduce the same behavior from the model.

That distinction matters because AI agents are non-deterministic and tool calls can have side effects. A replay system should surface the original prompt, response, tool call, and tool result by default. Re-running from a checkpoint should be an explicit action with safeguards, especially when the original tools modified files, called APIs, sent messages, or touched production data.

What agent session replay requires

Replay depends on capture. A system cannot replay an agent run if it only has aggregate metrics, partial logs, or a terminal scrollback buffer.

Useful replay requires a durable record of the agent-facing sequence: prompts, model responses, tool-use events, tool results, relevant metadata, and the context included in each model request. The more complete and structured the captured record is, the more useful replay becomes for debugging, audit, comparison, and skill extraction.

What agent session replay does not prove

Replay shows the observable sequence of a run. It can show what the user sent, what the model returned, what tools were called, what results came back, and what metadata was recorded.

It does not prove the model’s internal reasoning. Engineers can infer likely causes from the captured context, but replay should not be treated as access to hidden model thought. The value of replay is not mind-reading; it is preserving the evidence trail of what actually happened.

What problems agent session replay solves for engineering teams

Specific jobs replay does better than any alternative:

  • Debugging non-deterministic failures. Compare a failing run to a passing one; the divergence shows up in prompt text, response, or tool-use sequence.
  • Supporting audit and incident response. Compliance, security, and incident reviews all benefit from “what did the agent do” as a primary artifact rather than a reconstructed narrative — though satisfying any specific audit standard depends on retention, access controls, and process around the archive.
  • Comparing behavior across runs. Compare tool-call sequences, compare prompt and context differences, and — where re-execution is supported — compare behavior across models from the same captured starting point.
  • Onboarding. A new engineer replays the team’s best sessions to see how the agents actually work, not how someone wrote them up.
  • Skill extraction. The captured archive can be used to turn recurring patterns into reusable skill files.

How Paper Compute implements agent session replay

paper CLI uses tapes, Paper Compute’s open-source telemetry/capture layer, to route Claude Code traffic through a proxy and write captured session records — user prompts, assistant responses, tool-use events, tool results, and metadata — to an archive. In tapes, records are content-addressed, which gives the archive useful integrity and traversal properties: records can be looked up by hash, identical records resolve to the same identifier by construction, and any modification produces a different identifier and surfaces during a read. Content-addressing is one layer of integrity, not a finished tamper-evidence guarantee; end-to-end audit assurances still depend on access controls, retention, and chain of custody around the archive.

Replay is the read interface over that captured record. tapes deck is the interactive TUI; tapes search and tapes checkout are the programmatic consumers. If you’re running agents against production systems, replay is the artifact your incident response depends on. That’s the difference between “we think the agent did X” and “the agent did exactly this, here’s the hash.”

The point is not to guess better after something goes wrong. The point is to keep the record that makes guessing unnecessary.

Frequently asked questions

What is agent session replay? +
Agent session replay is the ability to step through a captured agent session — prompts, responses, tool-use events, tool results, and metadata — in its original sequence so an engineer or another agent can reconstruct what happened in a specific run and infer likely causes from the recorded context. Replay is durable inspection for non-deterministic systems: the recorded sequence can be inspected forward or backward, exported for analysis, or used as a comparison anchor against other runs.
Why do AI agents need session replay? +
AI agents are non-deterministic by default. The same prompt can produce different tool calls on different runs, and a failing session and a succeeding session can look identical on paper. Without a record an engineer can play back, debugging is guesswork. Replay matters most at the moment when reconstruction by hand is hardest: when a session has crashed, timed out, or produced an expensive loop, and the relevant context has already scrolled off the terminal.
How does agent session replay differ from logs and tracing? +
Logs, tracing, and replay are different artifacts for different jobs. Logs record discrete events with timestamps and answer 'did X happen?'. Tracing records request spans with parent-child links and answers 'where was the time spent?'. Replay reads the captured behavioral record as a sequence and answers 'what did the agent do, and what context led there?'. Logs and traces are excellent for events, timing, and system boundaries; replay is different because it preserves the agent-facing sequence in a form an engineer can step through, which matters for non-deterministic agents whose runs don't reproduce on demand.
Is agent session replay the same as time-travel debugging? +
Close, with one important difference. Time-travel debugging is a term borrowed from deterministic programs: the same inputs always produce the same outputs, so stepping backward and forward is an equivalent view. Agents are non-deterministic; replaying with the same inputs can produce different outputs. Session replay answers 'what did the agent actually do this run,' which time-travel debugging doesn't guarantee in a stochastic system.
Does agent session replay require a specific SDK? +
No. When capture happens at the proxy layer, replay reads from the network record — there is no SDK to adopt. Any tool that can route model traffic through a supported proxy path can become replayable, assuming the integration captures enough of the request, response, tool-use events, and metadata to reconstruct the sequence. The replay surface — inspection, programmatic export, comparison across runs — is the same regardless of which tool produced the record.
What happens when a replayed tool call has side effects? +
Replay should surface the original tool call and original tool result without re-executing the tool by default. Re-executing a tool call can repeat side effects — creating files, sending messages, modifying data, or making external API calls. Re-running from a checkpoint should be an explicit action with clear safeguards around side effects.

Where to go next