Interactive Demo

What happens when agents
remember failure.

Subagents don't just retry — they remember. tapes captures every agent trace — tool calls, reasoning steps, dead ends. An observer extracts what matters. The next batch of subagents inherits those observations. Each failure becomes institutional knowledge.

See it live on GitHub →

← All demos

The Setup

You point agents at a problem they can't solve in one shot

A distributed system keeps failing under load. Race conditions, cascading timeouts, flaky retries. No single debugging session cracks it. So you configure a harness — a meta-agent that spawns subagent variants, captures agent traces with tapes, and runs an observer that extracts observations from each run. The next batch of subagents inherits those observations. Each generation starts with everything the last one learned.

What's observational memory? Every subagent run produces agent traces captured by tapes — every tool call, every LLM response, every error. After each generation, an observer reads the traces and extracts prioritized observations: [important] errors, crashes, and failed fixes. [possible] files created, refactors attempted. [informational] session goals, token usage, dead ends explored. These get written to observations.md and injected into the next generation's context. The subagents don't just retry — they start with a map of what already failed and why.

jcard.toml — subagent harness

~ — you, the human

This is the last command you run. Everything after this is the harness working autonomously — spawning subagents, capturing agent traces with tapes, extracting observations, and building shared memory across generations.

Act I — Generation 1

First generation: explore the failure space

The harness spawns 5 subagents, each with different strategy parameters — different debugging approaches, triage priorities, and fix heuristics. They run in parallel, each in its own isolated VM. tapes proxies every LLM call, capturing the full agent trace.

inside harness — spawning gen 1

harness gen 1 5 subagents, each with a different strategy ├── agent-1a strategy: logs-first → stuck ├── agent-1b strategy: reproduce-first → wrong fix ├── agent-1c strategy: trace-backwards → partial ├── agent-1d strategy: isolate-service → red herring └── agent-1e strategy: bisect-commits → close each subagent: isolated VM · agent traces captured observer extracts → observations.md → next generation observation priorities: [important] errors, crashes, failed fixes [possible] files created, refactors attempted [informational] session goals, token usage, dead ends

Act II — Read the Agent Traces

The observer extracts what matters

Gen 1 is done. Most subagents failed. But every agent trace is captured — every tool call, every reasoning step, every dead end. The observer runs tapes search to query past traces, walks the conversation DAGs, and extracts prioritized observations. Errors get flagged [important]. Files created get tagged [possible]. Session goals and token usage become [informational]. All of it gets written to observations.md.

inside harness — analyzing agent traces

inside harness — building observations

This is observational memory at work. The harness doesn't retry with the same prompt. It reads observations.md, understands what each subagent found and where each one got stuck, and designs the next batch around those observations. Gen 2 subagents don't explore blindly — they start with the root cause already identified, the regression commit already narrowed, and the timing window already measured. Each generation accumulates observations. The knowledge compounds.

↻ generations 2 through 4 running autonomously ↻

Act III — Convergence

Generation 5 cracks it

Each generation got closer. The agent traces captured the trajectory — from scattered exploration in Gen 1 to focused, surgical debugging by Gen 5. One agent nails both the root cause and the fix. The harness extracts the best observations and writes the final report.

inside harness — gen 5 results

Two lines tell the story. Observations accumulate as each generation's agent traces get processed — the knowledge base grows from 14 entries to 97. As observations pile up, more subagents converge on a working solution. Gen 1 explored blind and none solved it. By Gen 5, four out of five subagents produced a validated fix — they started with 97 observations telling them exactly where to look and what to avoid.

observations (cumulative) subagents that solved it 100 ┤ ● 97 5 ┤ │ ╱ │ ● 4/5 75 ┤ ● ╱ │ ╱ │ ╱ ╱ │ ● 50 ┤ ● ╱ │ ╱ │ ╱ ╱ │ ● 25 ┤ ● ╱ │ ╱ │ ╱ │ 0 ┤ ╱ 0 ┤ ● ● ┼──────┼──────┼──────┼────── ┼──────┼──────┼──────┼────── gen 1 gen 2 gen 3 gen 4 gen 5 gen 1 gen 2 gen 3 gen 4 gen 5

The Results

You inspect the observation trail

Open your laptop. The harness finished overnight. 25 agent runs across 5 generations — every agent trace captured by tapes. The accumulated observations, the convergence path, and the fix are all in the report.

~ — you, inspecting results

This is the unlock. Subagents that build institutional knowledge from failure. Every agent trace is captured. The observer distills traces into prioritized observations. Each generation inherits what the last one learned — every dead end, every partial success, every wrong turn recorded in observations.md. The failure space becomes a map, and the observations teach the next batch how to read it.

◆ the infrastructure that makes this possible ◆

Kafka + Flink + Snowflake

25 subagents generate a lot of agent traces

5 subagents per generation, 5 generations, thousands of tool calls each. Every LLM request, every reasoning step, every file read — tapes captures agent traces for all of it. That's a firehose. SQLite works for post-mortem analysis, but to self-heal in real time, you need a streaming pipeline. tapes publishes every trace event to Kafka. Flink SQL processes the stream and fires alerts. Snowflake sinks every trace for long-term analysis — cost per model, failure patterns across runs, convergence rates over time. The harness reacts mid-generation — killing stuck subagents, reallocating resources, folding failures into the next batch before it even launches.

subagent VMs (5 per gen, all streaming concurrently) │ every LLM call routed through tapes proxy ▼ tapes proxy → intercepts traces, publishes to kafka ▼ kafka topic: agent.telemetry.raw │ key: root_hash (one stream per conversation) │ volume: ~12k events across 25 subagent runs ├────────────┬────────────┬────────────┐ ▼ ▼ ▼ ▼ flink sql telemetry subagent snowflake │ anomaly consumer harness sink │ detection (live) (self-heal) (long-term) ▼ agent.telemetry.alerts │ └──▶ harness kills stuck agent, captures traces, fixes

Why Kafka? Scale. A single subagent run generates hundreds of agent trace events — tool calls, LLM responses, token counts, reasoning steps. Multiply that by 5 subagents per generation and 5 generations, and you're pushing 12,000+ trace events through the pipeline. Kafka handles the throughput. Flink processes traces in real time. Snowflake sinks every trace for historical analysis — query cost-per-fix across months, find which strategies converge fastest, build dashboards on agent behavior at scale.

telemetry consumer — agent.telemetry.raw

alerts consumer — agent.telemetry.alerts

Self-healing in real time. Without this layer, you wait for each generation to finish before learning anything. With Kafka + Flink, the harness sees a stuck subagent in 30 seconds — kills it, snapshots the agent traces, folds the failure into observations, and spawns a replacement. All within the same generation. Meanwhile, every trace lands in Snowflake — so tomorrow you can query which failure patterns burned the most tokens and which strategies converged fastest.

Failure is the training data.

Agent traces captured. Observations accumulated. The next generation starts where the last one failed.

Get started →

What happens when agentsremember failure.