Every commercial aircraft carries a black box. Not because crashes are common — modern aviation is extraordinarily safe — but because when something goes wrong at 35,000 feet, you need to know exactly what happened. No guessing. No reconstructing from memory. The data is there, or it isn’t.
Agent systems don’t have black boxes. They should.
I was an hour deep into a Claude Code session, designing a Postgres integration for an agent memory system. Tool calls, schema decisions, trade-off discussions — all living in the context window. Then the session crashed. The context window closed and the entire conversation evaporated.
No write-ahead log. No checkpoint. No recovery path. An hour of design work, gone. Not because the model was bad, but because there was no durable record of what happened.
I got lucky. I’d been running tapes as a proxy between Claude and the API, and every request and response had been written to a local SQLite database. I pointed a fresh Claude session at the tapes DB, had it query the conversation history, and reconstructed the full context — design decisions, tool calls, exactly where I’d left off. The integration shipped to production that night. The full story is here.
But this isn’t a recovery anecdote. It’s a systems design problem that distributed computing solved thirty years ago.
Every production database writes a Write-Ahead Log (WAL) before committing a transaction. Every message queue checkpoints consumer offsets. Every event-sourced system can replay its entire history from the log. These aren’t optional features — they’re foundational primitives that exist because distributed systems assume failure.
Agent systems don’t assume failure. They assume the session will complete, the context window will hold, and the provider will stay up. When any of those assumptions break, you lose everything.
The primitives map directly:
| Primitive | What It Means for Agents |
|---|---|
| WAL | Every tool call, prompt, and response logged before execution. If the session dies, the log survives. |
| Checkpointing | Periodic snapshots of agent state — context, decisions made, progress markers. Resume from the last good checkpoint instead of starting over. |
| Replay | Reconstruct a session from its log, exactly as it happened. Debug failures by replaying the exact sequence of events. |
| Event Sourcing | The log is the source of truth. Derive any view of the session from the immutable event stream. |
None of this is novel computer science. It’s infrastructure engineering that agent systems haven’t adopted yet.
A useful agent telemetry system needs to capture every interaction as a content-addressed node in a Merkle DAG:
node {
hash -- content-addressed ID (SHA-256 of role + content + parent)
role -- user | assistant | tool_call | tool_result
content -- the actual payload
parent -- pointer to the previous node
timestamp -- when it happened
model -- which model processed this
tokens_in -- input token count
tokens_out -- output token count
latency_ms -- round-trip time
}
Content-addressing matters for three reasons:
This is what tapes stores in its local SQLite database. Every message — user prompts, assistant responses, tool calls, tool results — gets written as a node with a hash, a role, content, and a parent pointer. The full conversation history becomes a traversable, searchable, cryptographically verifiable graph.
Recording telemetry is the starting point. The real value compounds as you move up the stack:
This is the trajectory. Black box recorders don’t just explain crashes — they prevent them.
And the same telemetry that enables self-healing also closes the accountability gap. When your security team asks what the agent did, when compliance needs an audit trail, when your provider changes terms and you need to prove your session history is yours — the data is there. I wrote about this problem in detail in why agent visibility matters: organizations are banning agent tools not because the tools are bad, but because there’s no visibility into what they do. Telemetry solves both problems with the same infrastructure.
The simplest architecture that works: a transparent proxy between your agent and the model provider.
No code changes to your agent. No SDK integration. Point your agent at the proxy instead of the provider, and every interaction gets recorded into a durable, content-addressed session store. You get semantic search across your entire history, the ability to branch or replay from any point, and conversation checkpointing that survives crashes.
This is what tapes does. It’s open source, runs locally, and works with any model provider.
Teams are starting to collect data on how their developers use AI — acceptance rates, model usage, friction points. That’s a good start. But it doesn’t go far enough.
You also need to collect data on how your agents build software. Every tool call, every decision branch, every failure and recovery. Not for dashboards. For durability.
The gap between “agent that works” and “agent you can trust in production” is telemetry. Black box recorders are infrastructure. The teams that build for durability from the start are the ones that’ll still be running agents when everyone else is dealing with bans, outages, and audit failures.
Give your agents a black box. Start with tapes.
We are launching soon, subscribe for early access.