Show Your Work

I used to joke to my mentees that in bootcamp, “We aren’t shipping rockets. Push the code and see what happens.” It was good advice and it kept people moving. For a long time, it was true.

It’s not true anymore. We are shipping rockets now.

When an AI Model Shows Up in a Pentagon Meeting

This past weekend, Defense Secretary summoned Anthropic CEO Dario Amodei to the Pentagon to answer for how Claude is being used militarily. According to reporting from TechCrunch, Claude was allegedly used during the January 3 special operations raid that resulted in the capture of a Venezuelan president. The DoD is now threatening to designate Anthropic a “supply chain risk,” a label typically reserved for foreign adversaries, not because Anthropic did something wrong, but because they asked questions about how their model was being used.

Anthropic is asking the right question. The problem is that the infrastructure to answer it doesn’t exist yet. If the agent itself had telemetry baked in, durable logs of what it did, what tools it called, what decisions it made, the question answers itself. That’s the same gap your security team is going to hit after your agent touches a production database. The scale is different. The problem is identical.

The Same Story, Different Scale

This same story is playing out everywhere at different scales.

Amusing that Amazon’s idea of catching up with Kiro is banning Claude Code (when they are an Anthropic investor!) instead of the other way around:

Let devs vote with their feet, and have the Kiro team actually feel the heat to catch up based on merit.

Poor strategy for Kiro IMO pic.twitter.com/UY1gCIeJH4
— Gergely Orosz (@GergelyOrosz) February 19, 2026

Amazon banned Claude Code for internal developers. Not because the tool is bad. Anthropic is one of Amazon’s largest investments. They banned it because they couldn’t see what it was doing inside their systems. Earlier this month, Anthropic blocked third-party tools like OpenCode and OpenClaw from using subscription OAuth tokens after developers discovered they could route agentic workloads through flat-rate plans in ways that bypassed rate limits entirely. Anthropic’s own engineer said the problem was that third-party tools “generate unusual traffic patterns without any of the usual telemetry.” OpenCode, which had over 107,000 GitHub stars, had to push a commit removing Claude support entirely, citing “Anthropic legal requests.”

Every single one of these bans has the same root cause. Not bad models. Not bad intentions. There’s no visibility. When you can’t see what’s happening, you shut the door.

Agents Are Built to Disappear. Compliance Requires the Opposite.

Most agent systems right now are built to disappear. A session starts, tokens flow, something happens, the session ends. The context window closes and the record evaporates. There’s no durable log of what tool was called, what decision was made, what code was written, what file was touched. It’s stateless by design.

That works fine when you’re experimenting. It stops working the moment anyone asks you to prove what happened.

SOC2 auditors don’t ask if your AI is smart. FedRAMP reviewers don’t care how your model benchmarks on HumanEval. They ask: can you reconstruct what happened? Can you demonstrate who had access to what? Can you replay the incident? Can you show the change control?

Right now, for most teams running agents, the honest answer is no. SOC2 controls around monitoring, change management, and access logging don’t disappear because the actor is an agent.

Compliance teams know it. Security teams know it. That’s why the bans keep coming. Bans are what organizations reach for when they don’t have operator infrastructure. They’re not a solution. They’re a delay.

Can You Show Your Work

You might say: okay, but most of us aren’t shipping AI into military raids. Fair. But the trajectory is clear. Agents are already touching production systems. They’re executing shell commands, accessing secrets, modifying code, triggering deployments, interacting with regulated data. The “toy” phase is ending. The “critical system” phase is here, whether teams are ready for it or not.

And readiness, in this context, means one thing: can you show your work?

Observable Infrastructure Is What Lets You Keep Moving

This is what we’re building toward at Paper Compute. tapes exists because agent infrastructure needs the same primitives every distributed system eventually learned it needed: logs, traces, replayability, durable state. It records every request and response between your agent and your model providers so you can inspect, search, and verify exactly what happened. Because you need to know what happened regardless of whether or not it was wrong.

When your security team asks what the agent did last Tuesday, you have the answer. When you want to replay a session to debug a failure, you can. When your provider changes their terms overnight and you need to migrate, your session history is yours, not theirs.

“The point isn’t fear. The point is that observable infrastructure is what lets you keep moving.”

It’s the difference between banning a tool because you can’t see it and actually being able to use it in the places that matter.

The Observable Era Is the One That Scales

Compliance doesn’t require weaker AI. It requires observable AI.

The developers and teams who will build on agents in regulated environments, in enterprise contexts, in anything that eventually touches a procurement process or a security review, won’t be the ones who found the best model. They’ll be the ones who built for accountability from the start.

The push-and-see era was good while it lasted. But the observable era is the one that scales.

If you’re building agentic workflows and need to close the accountability gap, tapes is open source and built for exactly this. It sits between your agents and your model providers as a proxy, recording every request and response into durable, content-addressed sessions. No code changes required. You get semantic search across your entire session history, conversation checkpointing, and the ability to branch or replay from any point. Read the full announcement or go straight to the repo.