Agents Need Black Box Recorders

Every commercial aircraft carries a black box. Not because crashes are common — modern aviation is extraordinarily safe — but because when something goes wrong at 35,000 feet, you need to know exactly what happened. No guessing. No reconstructing from memory. The data is there, or it isn’t.

Agent systems don’t have black boxes. They should.

A Dead Session

I was an hour deep into a Claude Code session, designing a Postgres integration for an agent memory system. Tool calls, schema decisions, trade-off discussions — all living in the context window. Then the session crashed. The context window closed and the entire conversation evaporated.

No write-ahead log. No checkpoint. No recovery path. An hour of design work, gone. Not because the model was bad, but because there was no durable record of what happened.

I got lucky. I’d been running tapes as a proxy between Claude and the API, and every request and response had been written to a local SQLite database. I pointed a fresh Claude session at the tapes DB, had it query the conversation history, and reconstructed the full context — design decisions, tool calls, exactly where I’d left off. The integration shipped to production that night. The full story is here.

But this isn’t a recovery anecdote. It’s a systems design problem that distributed computing solved thirty years ago.

Distributed Systems Already Solved This

Every production database writes a Write-Ahead Log (WAL) before committing a transaction. Every message queue checkpoints consumer offsets. Every event-sourced system can replay its entire history from the log. These aren’t optional features — they’re foundational primitives that exist because distributed systems assume failure.

Agent systems don’t assume failure. They assume the session will complete, the context window will hold, and the provider will stay up. When any of those assumptions break, you lose everything.

The primitives map directly:

Primitive	What It Means for Agents
WAL	Every tool call, prompt, and response logged before execution. If the session dies, the log survives.
Checkpointing	Periodic snapshots of agent state — context, decisions made, progress markers. Resume from the last good checkpoint instead of starting over.
Replay	Reconstruct a session from its log, exactly as it happened. Debug failures by replaying the exact sequence of events.
Event Sourcing	The log is the source of truth. Derive any view of the session from the immutable event stream.

None of this is novel computer science. It’s infrastructure engineering that agent systems haven’t adopted yet.

What the Black Box Records

A useful agent telemetry system needs to capture every interaction as a content-addressed node in a Merkle DAG:

node {
  hash        -- content-addressed ID (SHA-256 of role + content + parent)
  role        -- user | assistant | tool_call | tool_result
  content     -- the actual payload
  parent      -- pointer to the previous node
  timestamp   -- when it happened
  model       -- which model processed this
  tokens_in   -- input token count
  tokens_out  -- output token count
  latency_ms  -- round-trip time
}

Content-addressing matters for three reasons:

1.Integrity — every node’s hash includes its parent’s hash. Tamper with any message and every downstream hash breaks. You get cryptographic proof that the log hasn’t been modified.
2.Deduplication — identical tool calls or system prompts hash to the same node. Storage stays compact even across thousands of sessions.
3.Branching — like git, you can fork a session from any node. Retry a failed tool call with different parameters. Explore alternative approaches from the same starting point.

This is what tapes stores in its local SQLite database. Every message — user prompts, assistant responses, tool calls, tool results — gets written as a node with a hash, a role, content, and a parent pointer. The full conversation history becomes a traversable, searchable, cryptographically verifiable graph.

From Recovery to Self-Healing

Recording telemetry is the starting point. The real value compounds as you move up the stack:

1.Recovery — replay a crashed session from the log. Reconstruct context, resume where you left off. This is the baseline.
2.Diagnosis — query across sessions to find failure patterns. Which tool calls fail most often? Where do agents get stuck in loops? What prompts produce hallucinations?
3.Prevention — detect anomalies in real-time. Token consumption spikes, repeated tool call failures, context window degradation. Flag these before the session fails.
4.Self-healing — agents that checkpoint themselves, detect when they’re drifting, and recover automatically. Not a human replaying a log — the agent doing it for itself.

This is the trajectory. Black box recorders don’t just explain crashes — they prevent them.

And the same telemetry that enables self-healing also closes the accountability gap. When your security team asks what the agent did, when compliance needs an audit trail, when your provider changes terms and you need to prove your session history is yours — the data is there. I wrote about this problem in detail in why agent visibility matters: organizations are banning agent tools not because the tools are bad, but because there’s no visibility into what they do. Telemetry solves both problems with the same infrastructure.

The Architecture

The simplest architecture that works: a transparent proxy between your agent and the model provider.

No code changes to your agent. No SDK integration. Point your agent at the proxy instead of the provider, and every interaction gets recorded into a durable, content-addressed session store. You get semantic search across your entire history, the ability to branch or replay from any point, and conversation checkpointing that survives crashes.

This is what tapes does. It’s open source, runs locally, and works with any model provider.

The Invisible Gap

Teams are starting to collect data on how their developers use AI — acceptance rates, model usage, friction points. That’s a good start. But it doesn’t go far enough.

You also need to collect data on how your agents build software. Every tool call, every decision branch, every failure and recovery. Not for dashboards. For durability.

The gap between “agent that works” and “agent you can trust in production” is telemetry. Black box recorders are infrastructure. The teams that build for durability from the start are the ones that’ll still be running agents when everyone else is dealing with bans, outages, and audit failures.

Give your agents a black box. Start with tapes.