Definition
Agent session replay is the ability to step through a captured agent session — prompts, responses, tool-use events, tool results, and metadata — in its original sequence so an engineer or another agent can reconstruct what happened in a specific run.
For non-deterministic systems like AI agents, replay isn’t a convenience. It’s the only honest way to answer the question that matters: what did the agent actually do?
A debugger inspects a live program. Replay inspects the record of a past run. When the terminal has scrolled, the context window has closed, and the agent won’t reproduce the same behavior twice, replay is what survives.
Agent session replay does not make an AI run deterministic. It makes the captured record inspectable.
What agent session replay reveals beyond raw logs
| Sequence | Captured turns in the order they happened, so the agent-facing flow can be inspected as a timeline. |
|---|---|
| Stable inspection | The recorded sequence produces the same view every time; re-execution is a separate action. |
| Checkpoints | Specific points in the captured sequence can be inspected, exported, or used as comparison anchors. |
Why agent session replay matters for debugging non-deterministic systems
Agents are non-deterministic by default. The same prompt produces different tool calls on different runs. A failing session and a succeeding session, on paper, can look identical. Without a record you can play back, debugging is guesswork.
Logs can tell you where the agent ended up. Replay shows you how it got there.
This matters most at the exact moment when it’s hardest to reconstruct by hand: when a session has crashed, timed out, or produced an expensive loop. The relevant information has already scrolled off the terminal. The context window is closed. The only surviving record is the one the capture layer wrote before the failure.
How agent session replay differs from logs and distributed tracing
Logs, tracing, and replay are different artifacts for different jobs.
| Layer | Captures | Answers |
|---|---|---|
| Logs | Discrete events with timestamps | "Did X happen?" |
| Tracing | Request spans with parent-child links | "Where was the time spent?" |
| Replay | The captured behavioral record as a sequence | "What did the agent do, and what context led there?" |
Logs and traces are excellent for events, timing, and system boundaries. Replay is different because it preserves the agent-facing sequence in a form an engineer can step through — which is what matters for non-deterministic agents whose runs don’t reproduce on demand.
Agent session replay vs chat history
Chat history is usually a user-facing transcript. It may omit tool results, metadata, retries, timing, provider payload details, or structured identifiers.
Agent session replay is an engineering artifact. It preserves the captured sequence as structured data so engineers can inspect, search, compare, export, and audit what happened during a run. Chat history helps a user remember a conversation. Replay helps a team investigate behavior.
What an agent session replay looks like step by step
Session 9be94ee3... ├── [0:00] user: "fix the Kafka topic bug" ├── [0:02] assistant: identifies likely error pattern ├── [0:04] tool: Read src/kafka/producer.ts ├── [0:07] tool: Bash "kafka-topics --list" ├── [0:09] assistant: topic missing — will create ├── [0:12] tool: Bash "kafka-topics --create ..." ├── [0:14] tool_result: topic created ├── [0:15] tool: Bash "npm test" ├── [0:45] tool_result: 5 passed, 0 failed └── [0:47] assistant: resolved
Replay systems can let you step forward or backward through the captured sequence, and content-addressed systems can support jumping to specific records by hash. From there you can export the transcript for analysis, feed it to a skill extractor, or compare it against another session — a successful run, a different model, a different prompt — to see exactly where behavior diverged.
Replay is not re-execution
Agent session replay shows the run that actually happened. It does not automatically re-run prompts, repeat tool calls, or reproduce the same behavior from the model.
That distinction matters because AI agents are non-deterministic and tool calls can have side effects. A replay system should surface the original prompt, response, tool call, and tool result by default. Re-running from a checkpoint should be an explicit action with safeguards, especially when the original tools modified files, called APIs, sent messages, or touched production data.
What agent session replay requires
Replay depends on capture. A system cannot replay an agent run if it only has aggregate metrics, partial logs, or a terminal scrollback buffer.
Useful replay requires a durable record of the agent-facing sequence: prompts, model responses, tool-use events, tool results, relevant metadata, and the context included in each model request. The more complete and structured the captured record is, the more useful replay becomes for debugging, audit, comparison, and skill extraction.
What agent session replay does not prove
Replay shows the observable sequence of a run. It can show what the user sent, what the model returned, what tools were called, what results came back, and what metadata was recorded.
It does not prove the model’s internal reasoning. Engineers can infer likely causes from the captured context, but replay should not be treated as access to hidden model thought. The value of replay is not mind-reading; it is preserving the evidence trail of what actually happened.
What problems agent session replay solves for engineering teams
Specific jobs replay does better than any alternative:
- Debugging non-deterministic failures. Compare a failing run to a passing one; the divergence shows up in prompt text, response, or tool-use sequence.
- Supporting audit and incident response. Compliance, security, and incident reviews all benefit from “what did the agent do” as a primary artifact rather than a reconstructed narrative — though satisfying any specific audit standard depends on retention, access controls, and process around the archive.
- Comparing behavior across runs. Compare tool-call sequences, compare prompt and context differences, and — where re-execution is supported — compare behavior across models from the same captured starting point.
- Onboarding. A new engineer replays the team’s best sessions to see how the agents actually work, not how someone wrote them up.
- Skill extraction. The captured archive can be used to turn recurring patterns into reusable skill files.
How Paper Compute implements agent session replay
paper CLI uses tapes, Paper Compute’s open-source telemetry/capture layer, to route Claude Code traffic through a proxy and write captured session records — user prompts, assistant responses, tool-use events, tool results, and metadata — to an archive. In tapes, records are content-addressed, which gives the archive useful integrity and traversal properties: records can be looked up by hash, identical records resolve to the same identifier by construction, and any modification produces a different identifier and surfaces during a read. Content-addressing is one layer of integrity, not a finished tamper-evidence guarantee; end-to-end audit assurances still depend on access controls, retention, and chain of custody around the archive.
Replay is the read interface over that captured record. tapes deck is the interactive TUI; tapes search and tapes checkout are the programmatic consumers. If you’re running agents against production systems, replay is the artifact your incident response depends on. That’s the difference between “we think the agent did X” and “the agent did exactly this, here’s the hash.”
The point is not to guess better after something goes wrong. The point is to keep the record that makes guessing unnecessary.