Brian’s recent post, the true cost of Claude Code, looked at the subsidy under modern AI coding tools. I wanted to see what the subsidy was actually paying for inside my own workflows, so I pulled nineteen days of Claude Code session data from my paper CLI session data: 3,697 messages across two projects and three models (Opus, Sonnet, and Haiku). Because every request was captured at the HTTP boundary, I had the model used, the token counts, the cache reads and writes, and the way each message connected to the ones before it. That’s more than a bill or a single transcript can show.
Prompt caching saved 82% of my input cost. The savings are real, but they also make a particular kind of architectural waste cheap enough to ignore: prompts that grow without limit, context blocks attached to the wrong tasks, sub-agents that rewrite their parent’s cache prefix. Caching subsidizes that architecture by burying it inside an aggregate number that looks healthy. The only way to see what is being subsidized is to look at the workflow underneath: which prompts stayed stable, which kept mutating, where parents forked into children, and whether the reused context actually contributed to useful work.
Prompt caching is just the observable surface of a deeper system: AI workflows behave as stateful, branching systems, and current telemetry only sees them as isolated calls.
A few years ago, “an AI call” was a single request. One prompt in, one response out. Logs, traces, and billing dashboards were enough to reason about it.
That world is over. Modern agent workflows are different on five axes at once: they run for a long time, they keep state across turns, they branch into sub-tasks, they call other agents, and they rewrite their own prompts as they go. A single coding session can spawn sub-agents, read dozens of files, run tools, and end up with a prompt that bears little resemblance to where it started. The individual model call is the cheap, well-instrumented part. The workflow around it is what decides whether the work was efficient, durable, or wasteful. Traditional telemetry doesn’t see any of it.
Request logs show one request at a time. Distributed traces show service-to-service flows, not prompt-to-prompt lineage. Billing dashboards aggregate everything into a total. None of those answer the questions that matter once a workflow is the unit of analysis:
These questions need different primitives: prompt lineage, context inheritance, cache topology, prompt mutation, sub-agent divergence, and session structure tracked over time.
Cheap reuse is not the same as good architecture.
Economic efficiency asks if we reuse tokens cheaply. Architectural efficiency asks if this context should have been reused at all. Those are different questions, and prompt caching only answers the first.
A workflow can be economically efficient and architecturally wasteful. It can also be architecturally clean and economically expensive when the work itself doesn’t repeat much. Confusing the two has concrete operational costs:
None of those show up on a bill. They show up in the topology of sessions captured over time.
Across nineteen days, my Claude Code usage hit 102.6 million input tokens with a 93.0% cache hit rate: 95.5M cache reads priced at 10% of base input, 5.0M cache writes priced at 125% of base input, and 2.1M fresh tokens at full price. Here and throughout, “cache hit rate” means the share of input tokens served as cache reads, computed across the whole nineteen-day window. That works out to about 17.9M base-equivalent tokens against a nominal 102.6M, an 82% savings on input.
In dollars, actual input cost was $232 against a no-cache equivalent of $1,321, a saving of $1,089. The no-cache equivalent would have been about 6.6x my $200/month Claude Code Max plan price, so the cache subsidy and the plan subsidy compound on top of each other.
| Model | Actual cost | No-cache cost | Saved |
|---|---|---|---|
| Opus 4.6 | $220 | $1,274 | $1,053 (82.7%) |
| Sonnet 4.6 | $8 | $41 | $33 (80.9%) |
| Haiku 4.5 | $4 | $6 | $3 (42.3%) |
Dollars rounded to the nearest whole. Percentages computed on the unrounded underlying values, so a few rows may not visibly cross-foot.
The bill tells me what I spent, but it can’t tell me which sessions, which models, or which prompt shapes were responsible for the ratio, or whether any of that work was reusable in the first place. The next-most-aggregated number is the cache hit rate. It tells you how much input was reused. It can’t tell you whether the reused context deserved to be there.
A 95% cache hit rate can mean you built a stable workflow. It can also mean you built a very efficient junk drawer.
When I broke the hit rate apart by model and the workload-shape, I learned a lot.
| Model | Messages | Input tokens | Cache hit rate | Fresh tokens |
|---|---|---|---|---|
| Opus 4.6 | 767 | 82.7M | 95.6% | 0.1M |
| Sonnet 4.6 | 136 | 13.3M | 94.4% | 0.0M |
| Haiku 4.5 | 478 | 6.6M | 58.6% | 2.1M |
Haiku ran 353 separate root sessions to handle 478 messages, roughly one session per message. Virtually all the fresh tokens in the dataset (2.1M of 2.2M across all three models) came from Haiku, which was my one-shot model: title generation, quick classifications, duplicate checks, small judgment calls. Each request had its own shape and rarely a long shared prefix to reuse. Opus and Sonnet, by contrast, ran inside long, stable sessions where the prefix was set once and reused constantly. A 95.6% hit rate on Opus is the prompt cache doing exactly what it’s supposed to do on workhorse workloads.
A healthy AI workflow doesn’t have one universal cache hit rate. It has different cache shapes for different kinds of work, and a single scalar smashes those distinctions together.
If your cheap one-shot model has the same cache profile as your expensive long-context model, one of them is probably lying about its job.
Even the per-model breakdown is still one number per row. Two sessions can post the same hit rate while behaving completely differently inside. That difference is the cache topology: the actual sequence of writes, reads, branches, mutations, and merges across a session. It is the shape the aggregates flatten, and it has to be captured directly.
Aggregate cache metrics flatten these differences. Each strip compresses one session into a sequence of write and read events.
The clearest example from my own data of the stable workhorse row was an Opus session from April 29: 728 messages, 42.9M total input tokens, 42.2M of them cache reads. The session opened with a 31k-token context write, then every subsequent call read that prefix back and added incrementally. You can see it in the prompt sizes stepping up: 31k, 32k, 39k, 43k, 44k, 46k, 47k, 51k, 54k. Writes appeared when new context was added. Reads dominated everywhere else. That’s a topology, not a number.
Once you can see sessions as shapes, four architectural patterns become legible.
The first pattern is the prompt that grows by accretion and never gets pruned. A system prompt starts at 800 tokens. Someone adds a rule, then a tool description, then a security note, then an edge case nobody remembers the reason for. Eventually every call carries thousands of tokens of instructions and no one knows which parts still matter.
The April 29 session is a quiet example. 255 separate cache-write events across 728 messages, roughly one write every three turns, sitting inside a session that still posted a 55x read-to-write ratio overall. The aggregate ratio says the workflow is healthy. The write distribution says the prompt kept mutating throughout. Economic efficiency and architectural drift can live in the same session.
The second pattern is a reusable context block attached too broadly. One large “standard” prompt bundle ends up on coding tasks, title generation, classification, one-line edits, and small judgment calls.
For long coding work, that context may earn its keep. For a tiny utility task, it doesn’t. The right question is whether this task should have received that context at all, not whether the cache is reusing it.
The third pattern is the parent-child loop that changes its prompt shape as it delegates. From the outside the workflow looks like one task. At the cache layer it is several separate prompt families. The parent writes and reads one prefix. Each sub-agent writes its own. The session still has reuse, but the lineage is fragmented.
This pattern only shows up if the captured data keeps the parent-child structure. paper records a parent_hash on every node, which let me read my 3,697 messages as a tree: 408 conversation roots, 3,289 child messages branching off them, with the deepest single thread running 495 messages deep. Without that tree, every message looks like an isolated cache event and parent-child divergence is invisible.
The fourth pattern is the short session that pays the 125% cache-write premium and gets little reuse back. Open the tool, dump a pile of context into one turn, do one thing, close the laptop.
The heaviest example from my own data was an Opus session from April 21: 14 messages, 87k tokens written to cache, 156k tokens read back, a 1.8x read-to-write ratio. The April 29 session sat at 55x for comparison. A session that reads back barely twice what it wrote is paying premium prices for almost no compounding benefit. It is a workflow choice with a cost that is small in isolation and large in aggregate when a team makes a habit of it.
The four patterns aren’t just diagnoses. Each one points at a concrete change.
If you find prompts that grow by accretion, prune them. Pull up the longest system prompts in your captured sessions and walk through them rule by rule. Anything nobody can defend should come out. The first request still pays the cache-write premium each time you change the prompt, so a smaller stable prefix is cheaper to maintain and easier to reason about.
If you find universal context blocks, separate work by shape. A tiny utility task should not inherit a coding agent’s full preamble. The framing should be “what is the minimum context this call needs to succeed,” not “what is the largest context the cache can absorb.”
If you find sub-agent divergence, normalize the prompt across the loop. Pick one stable prefix the parent and its children all share, then append the differences as a smaller tail. The cache stays usable across the tree, and the parent doesn’t have to rebuild context every time a child returns.
If you find context dumps, ask whether the work is really one-shot. A dump for one quick task is fine. A dump pattern repeating across a team is a workflow design choice, and the work can usually be reshaped into a session that pays the cache-write premium once instead of every time.
The bigger question is the same in all four cases: is this pattern one you chose, or one you ended up with? Telemetry doesn’t decide that for you. It puts the question in front of you instead of letting the bill answer it for you.
None of the four patterns above are visible at the request level, the trace level, or the bill level. They live at the session-and-topology level, which means surfacing them needs a different kind of telemetry than logging, tracing, or cost accounting. Four primitives matter:
Without these primitives, the patterns above stay invisible no matter how detailed the logs look. If you want to look at your own workflows the way I looked at mine, that is the data shape you need.
AI workflows are becoming systems. Architectures optimized around today’s cache subsidy may not survive tomorrow’s pricing, scale, or model changes. The expensive patterns feel like non-issues when they’re cheap. They become issues the moment the subsidy weakens, the model mix shifts, or the agent count grows.
The bill tells you what you spent. The transcript tells you what happened once. The topology tells you what shape your AI is settling into, and whether that shape will hold.
The time to inspect the shape is when the bill still looks fine → Get started with the paper CLI.
We are launching soon, subscribe for early access.