Subagents don't just retry — they remember. tapes captures every agent trace — tool calls, reasoning steps, dead ends. An observer extracts what matters. The next batch of subagents inherits those observations. Each failure becomes institutional knowledge.
A distributed system keeps failing under load. Race conditions, cascading timeouts, flaky retries. No single debugging session cracks it. So you configure a harness — a meta-agent that spawns subagent variants, captures agent traces with tapes, and runs an observer that extracts observations from each run. The next batch of subagents inherits those observations. Each generation starts with everything the last one learned.
tapes — every tool call, every LLM response, every error. After each generation, an observer reads the traces and extracts prioritized observations: [important] errors, crashes, and failed fixes. [possible] files created, refactors attempted. [informational] session goals, token usage, dead ends explored. These get written to observations.md and injected into the next generation's context. The subagents don't just retry — they start with a map of what already failed and why.
tapes, extracting observations, and building shared memory across generations.
The harness spawns 5 subagents, each with different strategy parameters — different debugging approaches, triage priorities, and fix heuristics. They run in parallel, each in its own isolated VM. tapes proxies every LLM call, capturing the full agent trace.
Gen 1 is done. Most subagents failed. But every agent trace is captured — every tool call, every reasoning step, every dead end. The observer runs tapes search to query past traces, walks the conversation DAGs, and extracts prioritized observations. Errors get flagged [important]. Files created get tagged [possible]. Session goals and token usage become [informational]. All of it gets written to observations.md.
observations.md, understands what each subagent found and where each one got stuck, and designs the next batch around those observations. Gen 2 subagents don't explore blindly — they start with the root cause already identified, the regression commit already narrowed, and the timing window already measured. Each generation accumulates observations. The knowledge compounds.
Each generation got closer. The agent traces captured the trajectory — from scattered exploration in Gen 1 to focused, surgical debugging by Gen 5. One agent nails both the root cause and the fix. The harness extracts the best observations and writes the final report.
Two lines tell the story. Observations accumulate as each generation's agent traces get processed — the knowledge base grows from 14 entries to 97. As observations pile up, more subagents converge on a working solution. Gen 1 explored blind and none solved it. By Gen 5, four out of five subagents produced a validated fix — they started with 97 observations telling them exactly where to look and what to avoid.
Open your laptop. The harness finished overnight. 25 agent runs across 5 generations — every agent trace captured by tapes. The accumulated observations, the convergence path, and the fix are all in the report.
observations.md. The failure space becomes a map, and the observations teach the next batch how to read it.
5 subagents per generation, 5 generations, thousands of tool calls each. Every LLM request, every reasoning step, every file read — tapes captures agent traces for all of it. That's a firehose. SQLite works for post-mortem analysis, but to self-heal in real time, you need a streaming pipeline. tapes publishes every trace event to Kafka. Flink SQL processes the stream and fires alerts. Snowflake sinks every trace for long-term analysis — cost per model, failure patterns across runs, convergence rates over time. The harness reacts mid-generation — killing stuck subagents, reallocating resources, folding failures into the next batch before it even launches.
Agent traces captured. Observations accumulated. The next generation starts where the last one failed.