The Missing Primitive in Agent Infrastructure

When agents are toys, observability is optional. When agents are experiments, failure is tolerable. When agents are production systems executing real work, neither is acceptable.

Visibility is the easiest piece. The hard part is analyzing and understanding what you’re observing… What are they doing with those traces? Literally nothing.
— Dev Shah (@0xDevShah) June 2026

In 2026, visibility matters. Logs. Traces. Metrics. Evaluation dashboards. Cost tracking. All of it is useful, but none of it sufficient. The next phase of agent infrastructure won’t be defined by visibility. It will be defined by durability.

Before agents can heal, they must be replayable. And before they are replayable, they must have memory.

The Gap That Keeps Showing Up

There’s a pattern across production deployments. Teams have logs. They have metrics. They’re collecting traces. But when an agent makes a decision they can’t explain, when a workflow fails at step 47 of 100, the ability to understand why vanishes with the session.

The infrastructure exists. OpenTelemetry is the default. Langfuse, Phoenix, Helicone, LangSmith, and AgentOps provide dashboards and cost tracking. But there’s a primitive most tools don’t address: durable, full-fidelity session capture with deterministic replay and checkpointing.

Without it, every agent run is ephemeral. You can see that something happened. You can’t rewind the tape to understand why—or resume from the exact moment things broke.

tapes.dev is a transparent proxy built to close that gap. Zero code changes. Full session capture. Deterministic replay. And it’s the foundation for something bigger: the infrastructure layer that makes agents not just observable, but durable, orchestratable, and eventually self-healing.

What the Ecosystem Has Built (and Where It Stops)

By early 2026, agent observability has consolidated around a solid foundation. OpenTelemetry traces non-deterministic loops and tool choices. Microsoft shipped multi-agent extensions. Splunk and Dynatrace went GA on agent monitoring. The leading platforms do what they were designed to do well: cost tracking per step, prompt and response logging, evaluation frameworks, and real-time alerting.

Nobody ships agents to production without these capabilities. They’re essential.

But they share a limitation that becomes visible under pressure. Most produce ephemeral or summarized data. Once a session ends, you cannot deterministically replay it. You cannot resume from a specific decision point. You can review what happened at a summary level, but you can’t step through the session examining each choice with full context.

The sessions disappear. And that creates a gap.

The Problem with Agent “Visibility”

It’s not observability if you can only see what tools are called, what prompts are used, responses generated, and how much the run costs. Sure, all of those things are helpful, but when something goes wrong, that’s not going to help you fix it.

Traces Without Replay

Collecting traces is table stakes. The value comes from what you do with them. Right now, most teams aren’t doing anything meaningful with the data they collect.

Non-deterministic paths make debugging guesswork. When an agent fails, you want the exact sequence of tool calls, API responses, and reasoning steps. Most tools give you summaries. They don’t give you the ability to step through each decision point like a debugger, examining the full context of what the agent “knew” at that moment.

There’s no rewind button. Your agent made a bad decision at step 47 of a 100-step workflow. You can’t resume from step 46. You start over, re-prompt, and hope the LLM takes the same path. It won’t.

Historical sessions can’t be queried as first-class data. As agents become longer-running and more autonomous, the ability to query what an agent decided three days ago—in the context of what it knew then—isn’t a nice-to-have. It’s a requirement. Without a durable capture layer, that data doesn’t exist to query.

Without execution fidelity, debugging becomes guesswork. You can build secure, isolated runtimes or rich telemetry. Getting both without friction is the hard part.
— Aquarius Acquah (@aquariusacquah) June 2026

And yet, production agents need both. Distributed systems solved this decades ago. Agents are about to rediscover why that mattered.

The Missing Primitive: Durable Session Capture

What’s missing isn’t another dashboard. It’s a durable execution layer. A system that captures agent sessions in full fidelity—not summaries or extracted metadata. The actual sequence of decisions, requests, and responses. Durably stored, content-addressed, deterministically replayable.

01.Durable session capture (tapes). Full-fidelity history, deterministic replay, checkpoint/resume. Debugging becomes inspection instead of reconstruction. Audit trails become real.
02.Sandboxed orchestration. Isolated agent runtimes, lifecycle control, replay inside the sandbox. Removes the “secure vs. observable” tradeoff. Failures are contained and recoverable.
03.Operator system. Policies, reproducible environments, guardrails, governance. Turns agent execution into a controllable system with verifiable state transitions.
04.Self-healing workloads. Automated recovery and iterative improvement. Systems can safely resume, rollback, and learn from durable history.

“If you can’t rewind a system to an exact state and step through it, you don’t truly understand it. And if you don’t understand it, you cannot make it reliable.”

This is the primitive we’ve built first.

Step One: Make Agent Sessions Durable

What if observability worked inside the sandbox with zero code changes?

tapes.dev is a transparent proxy. Point your agent at localhost:8080 (or the sandbox hostname), and it captures every raw request and response between your agent and model providers. It stores them durably using content-addressable SQLite by default. Then it gives you tools to explore, replay, and checkpoint.

Searchable archives. Every session is queryable. Find the moment an agent called a specific tool, returned an error, or made a decision you need to understand.

Deterministic replay. Run a session again, step by step, exactly as it happened. Not a simulation. The actual sequence of calls and responses.

Checkpoint and resume. tapes checkout <hash> resumes from an exact state. This works because tapes uses content-addressable hashing—similar to how git works. Each conversation turn has its own hash based on the model, the content, and the previous hash in the conversation. Agent failed at step 47? Rewind to step 46, adjust, continue. You can branch conversations, retry from any point, and fork sessions at any decision point.

Deck TUI and web explorer. Navigate session history with interfaces designed for humans, not just log parsers.

Think of it as long-term care for your agent sessions. The primitive that makes audit trails, reflection loops, and verifiable execution possible.

What This Looks Like When Something Breaks

Say you’re running an agent that processes customer support tickets. It reads the ticket, queries an internal knowledge base, drafts a response, checks it against policy, and sends it. Fifty steps, give or take.

On Tuesday morning, a customer complains that the agent gave incorrect refund information. Without durable session capture, you dig through logs, try to reconstruct what the agent “saw” at each step, and maybe find the problem. Maybe not. The session data has already been summarized or discarded.

With tapes running as a sidecar, you search for that session by timestamp or content. You step through the exact sequence. At step 31, the knowledge base returned outdated policy information, and the agent built its response on top of that. You checkpoint at step 30, update the knowledge base, and replay from that point to verify the fix produces a correct response.

No re-prompting. No guessing. No hoping the LLM takes the same path.

“That’s the difference between observability that tells you what happened and infrastructure that lets you understand why—and recover from the exact point of failure.”

Why Durability Changes the Physics

You cannot orchestrate what you cannot replay. You cannot secure what you cannot audit. You cannot build self-healing systems on ephemeral execution.

The industry is racing toward orchestration frameworks and multi-agent systems. But orchestration without durability is choreography on sand. When something breaks, the state disappears. Durability changes the physics.

Once sessions are replayable, isolation becomes possible without sacrificing insight. And that leads to the next layer.

Step Two: Sandboxed Orchestration

Production agents live in tension. Secure or observable. Pick one. Highly isolated sandboxes often obscure execution details. Rich telemetry often implies broader permissions.

That tradeoff is artificial—but only if execution is durable inside the sandbox.

The next layer we’re building is sandboxed orchestration: isolated agent environments where every decision is recorded, replayable, and auditable. Security and observability stop competing and start reinforcing each other.

Durable capture inside isolated runtimes removes the blind spots that make production agents fragile. Orchestration becomes more than scheduling—it becomes controlled execution.

Step 1 makes agent execution replayable. Step 2 makes agent execution schedulable and isolatable.

Think about containers before Kubernetes. You could package software. You could run it. You could inspect logs. But without orchestration: no isolation guarantees, no scaling control, no workload management, no structured lifecycle.

Agents today are like pre-orchestration containers. They run. They produce output. They crash. You manually intervene.

Sandboxed orchestration for agents is what Kubernetes was for containers:

01.Defined environments. Each agent runs in a known, reproducible context.
02.Controlled lifecycle. Start, stop, pause, resume—with durable state at every transition.
03.Managed isolation. Failures are contained. One agent crashing doesn’t take down the fleet.
04.Observable execution inside boundaries. Full visibility without sacrificing security boundaries.

Without orchestration, durable capture is forensic. With orchestration, it becomes operational.

The Trajectory

At Paper Compute, we started with observability because it’s the prerequisite for everything else in the agent infrastructure stack. You can’t debug what you can’t see. You can’t optimize what you can’t measure. And you can’t heal what you don’t understand.

tapes is step one. It gives you the durable, auditable record of every agent session. But the reason we built it isn’t just to solve today’s debugging problem. It’s because everything we want to build next requires full-fidelity session capture as a foundation.

The trajectory: observable agents → orchestratable agents → self-healing agents. Each step depends on the one before it. You can’t orchestrate what you can’t observe. You can’t build self-healing systems without durable records of what went wrong, what worked, and why.

This is what we mean by “distributed systems primitives for AI agents.” The same patterns that made distributed systems reliable—durable execution, deterministic replay, verifiable state transitions—are exactly what agents need to move from demos to production.

Self-healing is not a feature toggle. It’s the result of layered infrastructure:

01.Durable memory. Full-fidelity capture of every decision and interaction.
02.Isolated execution. Sandboxed runtimes where failures are contained.
03.Deterministic replay. The ability to rewind and step through any session.
04.Controlled state transitions. Verifiable, auditable changes at every step.

When an agent can rewind, compare, and resume safely, recovery stops being manual. When systems can query their own execution history, improvement becomes possible.

But none of that exists without the first primitive.

Why This Matters Now

Three forces are converging.

Long-running agents are generating massive histories that logs cannot meaningfully represent. Regulated domains are demanding verifiable provenance, not screenshots of dashboards. Self-improving systems require full-fidelity history to learn from past decisions.

We are moving from “agents as tools” to “agents as infrastructure.” Infrastructure requires primitives. Durability is the first one.

Jason Warner recently argued that third-party agents face structural headwinds because everything collapses into the model—and first-party agents from model providers will always outperform independent builders. If that’s true, then the ability to observe, replay, audit, and iteratively improve your agents inside secure runtimes becomes a survival skill. For teams building outside model providers, full-fidelity observability and durable infrastructure might be among the few differentiators you actually control.

Start With the Primitive

Check out our Quick Start, and observe the difference between watching a system and understanding it.

We are building distributed systems primitives for AI agents—beginning with memory and moving towards self-healing. The primitive is here. Now we build on top of it.