Why do AI agent workflows need new telemetry primitives?

AI agent workflows are persistent, stateful, branching, multi-agent systems that mutate their own context turn by turn. Traditional telemetry, including request logs, distributed traces, and billing dashboards, was built for single-request services. It can show what happened in one call, but not whether prompts stayed stable across a session, whether a child agent inherited the parent's prefix, where context branched and merged, or how a workflow's prompt shape drifted week over week. Those questions need different primitives.

What telemetry primitives do AI workflows need?

Four foundational capabilities. HTTP-boundary capture so every model call is recorded in one consistent shape across tools instead of stitched together from per-tool SDKs. Queryable session structure so a week of agent sessions can be sorted by cache writes, grouped by model, and compared by prompt shape. Lineage preservation so every node knows its parent and parent-child divergence shows up as structure. Topology-aware analysis that treats a session as a sequence of write and read events with shape, not as a single hit-rate scalar. Together they enable analyses that matter for agent systems: prompt lineage, context inheritance, cache topology, prompt mutation, sub-agent divergence, and session structure tracked over time.

What is cache topology and why does it matter more than the cache hit rate?

Cache topology is the shape of cache events across a session: when prompts were written, when they were read, when they branched, when they mutated, when they merged back. The hit rate is a single scalar over those events. Two sessions can post nearly identical hit rates with completely different topologies. One reuses a stable prefix across a long workflow. The other keeps rewriting cache entries as the prompt mutates and still produces a healthy aggregate ratio. Only the topology distinguishes them, which is why aggregate metrics can describe spend but cannot describe architecture.

What's the difference between economic efficiency and architectural efficiency?

Economic efficiency asks whether you reused tokens cheaply. Architectural efficiency asks whether the context you reused deserved to be reused at all. Prompt caching answers the first and is silent on the second. A workflow can post a high cache hit rate (economically efficient) while serving thousands of tokens of context the specific call doesn't need (architecturally wasteful). The two have to be read together against the shape of the sessions that produced them.

Should every model have a high cache hit rate?

No. Long-context workhorse models running stable system prompts and project context should reach high cache hit rates because most of their input gets reused. One-shot utility models that handle requests with little shared context will run a lower hit rate, and that is healthy. If a cheap one-shot model and an expensive long-context model show the same cache profile, one of them is probably carrying context it doesn't need or failing to reuse context it should. The hit rate alone won't tell you which, which is why workload telemetry has to capture session shape, not just aggregate ratios.

What is sub-agent divergence, and how do you detect it?

Sub-agent divergence is when a parent agent spawns child agents that don't share the parent's prompt prefix. From the outside the workflow looks like one task. At the cache layer it is several separate prompt families, each writing its own cache instead of reading the parent's. Detecting it requires parent-child lineage in the captured data. Without a parent_hash on every node, every message looks like an isolated cache event and divergence is invisible. With lineage, you can compare prompt prefixes across each parent and child pair and ask whether the child inherited cleanly or rewrote the shared context.

What's the difference between high-value and low-value reusable context?

High-value reusable context is content the model would benefit from on most of the calls that include it: core repository structure for coding agents, stable tooling schemas, coding conventions, and project glossaries. The context is stable, the agent actually uses it, and the reuse pays back. Low-value reusable context is content the model is paying to carry but rarely uses on the call it's attached to: obsolete instructions, irrelevant architecture docs, duplicate rules, and edge-case handling for code paths the session never enters. The cache treats both kinds identically, so a high cache hit rate alone doesn't tell you whether the reuse was useful.

What's a practical way to start capturing AI workflow telemetry?

Capture a week of agent sessions at the HTTP boundary so model, token counts, cache reads, cache writes, and parent-child lineage are all joinable. Then ask which sessions created the most cache, which sessions wrote cache that never paid back in reads, which models have cache shapes that don't match their job, and which agent handoffs rewrote context they could have inherited. You don't need a full observability platform to start. You need session structure preserved in a queryable shape. paper CLI is one way to assemble those primitives.

Prompt Caching Is Subsidizing Bad AI Architecture

Brian’s recent post, the true cost of Claude Code, looked at the subsidy under modern AI coding tools. I wanted to see what the subsidy was actually paying for inside my own workflows, so I pulled nineteen days of Claude Code session data from my paper CLI session data: 3,697 messages across two projects and three models (Opus, Sonnet, and Haiku). Because every request was captured at the HTTP boundary, I had the model used, the token counts, the cache reads and writes, and the way each message connected to the ones before it. That’s more than a bill or a single transcript can show.

Prompt caching saved 82% of my input cost. The savings are real, but they also make a particular kind of architectural waste cheap enough to ignore: prompts that grow without limit, context blocks attached to the wrong tasks, sub-agents that rewrite their parent’s cache prefix. Caching subsidizes that architecture by burying it inside an aggregate number that looks healthy. The only way to see what is being subsidized is to look at the workflow underneath: which prompts stayed stable, which kept mutating, where parents forked into children, and whether the reused context actually contributed to useful work.

Prompt caching is just the observable surface of a deeper system: AI workflows behave as stateful, branching systems, and current telemetry only sees them as isolated calls.

AI workflows have outgrown traditional telemetry

A few years ago, “an AI call” was a single request. One prompt in, one response out. Logs, traces, and billing dashboards were enough to reason about it.

That world is over. Modern agent workflows are different on five axes at once: they run for a long time, they keep state across turns, they branch into sub-tasks, they call other agents, and they rewrite their own prompts as they go. A single coding session can spawn sub-agents, read dozens of files, run tools, and end up with a prompt that bears little resemblance to where it started. The individual model call is the cheap, well-instrumented part. The workflow around it is what decides whether the work was efficient, durable, or wasteful. Traditional telemetry doesn’t see any of it.

Request logs show one request at a time. Distributed traces show service-to-service flows, not prompt-to-prompt lineage. Billing dashboards aggregate everything into a total. None of those answer the questions that matter once a workflow is the unit of analysis:

Which prompts stay stable across a session, and which keep mutating?
When a parent agent spawns a sub-agent, does the child inherit the parent’s prompt prefix or build its own?
Where does context branch, and where does it merge?
Which sessions paid to create cache that never paid back?
How is a workflow’s prompt shape drifting week over week?

These questions need different primitives: prompt lineage, context inheritance, cache topology, prompt mutation, sub-agent divergence, and session structure tracked over time.

Economic vs. architectural efficiency

Cheap reuse is not the same as good architecture.

Economic efficiency asks if we reuse tokens cheaply. Architectural efficiency asks if this context should have been reused at all. Those are different questions, and prompt caching only answers the first.

A workflow can be economically efficient and architecturally wasteful. It can also be architecturally clean and economically expensive when the work itself doesn’t repeat much. Confusing the two has concrete operational costs:

Runaway context growth. Prompts pick up rules and notes faster than anyone removes them.
Hidden cost amplification. Wasteful workflows look cheap while caching is generous and turn expensive the moment the discount weakens.
Prompt entropy. Sub-agents and projects drift apart slowly, and cache reuse degrades with them.
Degraded cache reuse. A high aggregate hit rate can hide sessions that keep rewriting their own prefix.
Hard to debug. When two sessions act differently, there is no way to compare what changed in the prompts that drove them.
Hard to compare architectures. No way to ask whether one team’s agent design reuses context better than another’s.

None of those show up on a bill. They show up in the topology of sessions captured over time.

What the data showed

Across nineteen days, my Claude Code usage hit 102.6 million input tokens with a 93.0% cache hit rate: 95.5M cache reads priced at 10% of base input, 5.0M cache writes priced at 125% of base input, and 2.1M fresh tokens at full price. Here and throughout, “cache hit rate” means the share of input tokens served as cache reads, computed across the whole nineteen-day window. That works out to about 17.9M base-equivalent tokens against a nominal 102.6M, an 82% savings on input.

In dollars, actual input cost was $232 against a no-cache equivalent of $1,321, a saving of $1,089. The no-cache equivalent would have been about 6.6x my $200/month Claude Code Max plan price, so the cache subsidy and the plan subsidy compound on top of each other.

Model	Actual cost	No-cache cost	Saved
Opus 4.6	$220	$1,274	$1,053 (82.7%)
Sonnet 4.6	$8	$41	$33 (80.9%)
Haiku 4.5	$4	$6	$3 (42.3%)

Dollars rounded to the nearest whole. Percentages computed on the unrounded underlying values, so a few rows may not visibly cross-foot.

The bill tells me what I spent, but it can’t tell me which sessions, which models, or which prompt shapes were responsible for the ratio, or whether any of that work was reusable in the first place. The next-most-aggregated number is the cache hit rate. It tells you how much input was reused. It can’t tell you whether the reused context deserved to be there.

A 95% cache hit rate can mean you built a stable workflow. It can also mean you built a very efficient junk drawer.

When I broke the hit rate apart by model and the workload-shape, I learned a lot.

Model	Messages	Input tokens	Cache hit rate	Fresh tokens
Opus 4.6	767	82.7M	95.6%	0.1M
Sonnet 4.6	136	13.3M	94.4%	0.0M
Haiku 4.5	478	6.6M	58.6%	2.1M

Haiku ran 353 separate root sessions to handle 478 messages, roughly one session per message. Virtually all the fresh tokens in the dataset (2.1M of 2.2M across all three models) came from Haiku, which was my one-shot model: title generation, quick classifications, duplicate checks, small judgment calls. Each request had its own shape and rarely a long shared prefix to reuse. Opus and Sonnet, by contrast, ran inside long, stable sessions where the prefix was set once and reused constantly. A 95.6% hit rate on Opus is the prompt cache doing exactly what it’s supposed to do on workhorse workloads.

A healthy AI workflow doesn’t have one universal cache hit rate. It has different cache shapes for different kinds of work, and a single scalar smashes those distinctions together.

If your cheap one-shot model has the same cache profile as your expensive long-context model, one of them is probably lying about its job.

Even the per-model breakdown is still one number per row. Two sessions can post the same hit rate while behaving completely differently inside. That difference is the cache topology: the actual sequence of writes, reads, branches, mutations, and merges across a session. It is the shape the aggregates flatten, and it has to be captured directly.

Cache topology examples

Aggregate cache metrics flatten these differences. Each strip compresses one session into a sequence of write and read events.

Pattern

Model

Msgs

Writes

Reads

Shape

Stable workhorse

Opus

120K

2.4M

Context dump

Haiku

45K

15K

Utility call

Haiku

Mutation candidate

Opus

87K

156K

Branching candidate

Opus

140K

800K

├├

write-heavy turn read-heavy turn mostly fresh input├ branch no meaningful reuse

The same cache-read share can sit on top of stable reuse, short-lived dumps, prefix mutation, or branching prompt families.

The clearest example from my own data of the stable workhorse row was an Opus session from April 29: 728 messages, 42.9M total input tokens, 42.2M of them cache reads. The session opened with a 31k-token context write, then every subsequent call read that prefix back and added incrementally. You can see it in the prompt sizes stepping up: 31k, 32k, 39k, 43k, 44k, 46k, 47k, 51k, 54k. Writes appeared when new context was added. Reads dominated everywhere else. That’s a topology, not a number.

Four patterns the topology surfaces

Once you can see sessions as shapes, four architectural patterns become legible.

Prompt Accretion: system prompts that only grow

The first pattern is the prompt that grows by accretion and never gets pruned. A system prompt starts at 800 tokens. Someone adds a rule, then a tool description, then a security note, then an edge case nobody remembers the reason for. Eventually every call carries thousands of tokens of instructions and no one knows which parts still matter.

The April 29 session is a quiet example. 255 separate cache-write events across 728 messages, roughly one write every three turns, sitting inside a session that still posted a 55x read-to-write ratio overall. The aggregate ratio says the workflow is healthy. The write distribution says the prompt kept mutating throughout. Economic efficiency and architectural drift can live in the same session.

Universal context blocks: the same bundle attached to the wrong jobs

The second pattern is a reusable context block attached too broadly. One large “standard” prompt bundle ends up on coding tasks, title generation, classification, one-line edits, and small judgment calls.

For long coding work, that context may earn its keep. For a tiny utility task, it doesn’t. The right question is whether this task should have received that context at all, not whether the cache is reusing it.

Sub-agent divergence: child agents that fragment the cache prefix

The third pattern is the parent-child loop that changes its prompt shape as it delegates. From the outside the workflow looks like one task. At the cache layer it is several separate prompt families. The parent writes and reads one prefix. Each sub-agent writes its own. The session still has reuse, but the lineage is fragmented.

This pattern only shows up if the captured data keeps the parent-child structure. paper records a parent_hash on every node, which let me read my 3,697 messages as a tree: 408 conversation roots, 3,289 child messages branching off them, with the deepest single thread running 495 messages deep. Without that tree, every message looks like an isolated cache event and parent-child divergence is invisible.

Context dumps: large cache writes with little reuse

The fourth pattern is the short session that pays the 125% cache-write premium and gets little reuse back. Open the tool, dump a pile of context into one turn, do one thing, close the laptop.

The heaviest example from my own data was an Opus session from April 21: 14 messages, 87k tokens written to cache, 156k tokens read back, a 1.8x read-to-write ratio. The April 29 session sat at 55x for comparison. A session that reads back barely twice what it wrote is paying premium prices for almost no compounding benefit. It is a workflow choice with a cost that is small in isolation and large in aggregate when a team makes a habit of it.

What this means in practice

The four patterns aren’t just diagnoses. Each one points at a concrete change.

If you find prompts that grow by accretion, prune them. Pull up the longest system prompts in your captured sessions and walk through them rule by rule. Anything nobody can defend should come out. The first request still pays the cache-write premium each time you change the prompt, so a smaller stable prefix is cheaper to maintain and easier to reason about.

If you find universal context blocks, separate work by shape. A tiny utility task should not inherit a coding agent’s full preamble. The framing should be “what is the minimum context this call needs to succeed,” not “what is the largest context the cache can absorb.”

If you find sub-agent divergence, normalize the prompt across the loop. Pick one stable prefix the parent and its children all share, then append the differences as a smaller tail. The cache stays usable across the tree, and the parent doesn’t have to rebuild context every time a child returns.

If you find context dumps, ask whether the work is really one-shot. A dump for one quick task is fine. A dump pattern repeating across a team is a workflow design choice, and the work can usually be reshaped into a session that pays the cache-write premium once instead of every time.

The bigger question is the same in all four cases: is this pattern one you chose, or one you ended up with? Telemetry doesn’t decide that for you. It puts the question in front of you instead of letting the bill answer it for you.

What it takes to see this in your own workflows

None of the four patterns above are visible at the request level, the trace level, or the bill level. They live at the session-and-topology level, which means surfacing them needs a different kind of telemetry than logging, tracing, or cost accounting. Four primitives matter:

HTTP-boundary capture. Every model call recorded in one consistent shape, regardless of which tool made it. When you switch from Claude Code to the next tool, the data layer stays the same.
Queryable session structure. A week of sessions you can sort by cache writes, group by model, or compare by prompt shape. “Show me the five heaviest cache-write sessions this week” should be a query, not an afternoon of reading transcripts.
Lineage preservation. Every node knowing its parent. This is what makes sub-agent divergence visible, and what lets you trace one fan-out back to the call that started it.
Topology-aware analysis. A session treated as a sequence of write and read events with shape, not a single hit-rate scalar. This is what produces the figure above.

Without these primitives, the patterns above stay invisible no matter how detailed the logs look. If you want to look at your own workflows the way I looked at mine, that is the data shape you need.

The time to inspect the shape is when the bill still looks fine

AI workflows are becoming systems. Architectures optimized around today’s cache subsidy may not survive tomorrow’s pricing, scale, or model changes. The expensive patterns feel like non-issues when they’re cheap. They become issues the moment the subsidy weakens, the model mix shifts, or the agent count grows.

The bill tells you what you spent. The transcript tells you what happened once. The topology tells you what shape your AI is settling into, and whether that shape will hold.

The time to inspect the shape is when the bill still looks fine → Get started with the paper CLI.

Prompt Caching Is Subsidizing Bad AI Architecture

AI workflows have outgrown traditional telemetry

Economic vs. architectural efficiency

What the data showed

Cache topology examples

Four patterns the topology surfaces

Prompt Accretion: system prompts that only grow

Universal context blocks: the same bundle attached to the wrong jobs

Sub-agent divergence: child agents that fragment the cache prefix

Context dumps: large cache writes with little reuse

What this means in practice

What it takes to see this in your own workflows

The time to inspect the shape is when the bill still looks fine

Frequently asked questions

From the blog

Concepts

Join the waitlist

Prompt Caching Is Subsidizing Bad AI Architecture

AI workflows have outgrown traditional telemetry

Economic vs. architectural efficiency

What the data showed

Cache topology examples

Four patterns the topology surfaces

Prompt Accretion: system prompts that only grow

Universal context blocks: the same bundle attached to the wrong jobs

Sub-agent divergence: child agents that fragment the cache prefix

Context dumps: large cache writes with little reuse

What this means in practice

What it takes to see this in your own workflows

The time to inspect the shape is when the bill still looks fine

Frequently asked questions

Related reading

From the blog

Concepts

Join the waitlist