Logs Are the Self-Healing Feedback Loop

Previously in this series: Pokémon Red agent + evolution loop · Sweeper: parallel lint + docs fixer · Pokémon technical deep-dive

Something funny happens when you watch an agent play Pokémon in the terminal.

You see lines like this scroll by:

NAV | Moving down toward (10, 4)
BATTLE | Won! Charmander defeated Squirtle
MILESTONE | Reached Route 1!
BACKTRACK | Cleared pre-party snapshots

At first glance it looks like debugging output. Maybe something Claude Code is generating. But that’s not what’s happening. Those lines are produced by the agent itself. It narrates its own decisions. Every navigation choice, battle outcome, and backtrack trigger.

The agent runs headless inside a stereOS VM. PyBoy emulates Pokémon Red with no display server, running roughly 100× faster than real time. No emulator window. No vision model. Just memory reads and button inputs. Claude Code isn’t generating that output. It’s piping raw stdout from the subprocess.

The logs are the only window into the system.

So the agent was intentionally designed to be verbose. You should be able to read the terminal output and follow every decision step by step. That turns out to matter a lot. Because those same logs power everything that comes next.

The problem everyone keeps running into

If you’ve spent time working with agents, this probably sounds familiar.

An agent can work on something for hours, explore solutions, figure out what fails, and then the session ends. Everything it just learned disappears. Every new run starts from zero. That means every few hours of work feels like onboarding a brand-new intern.

continual learning, only continual learning, and nothing other than continual learning, is what's missing right now

I couldn't care less about saturating benchmarks, getting +3% in SWE Bench or whatever will not make these tools much better than they are, for as long as they…
— Taelin (@VictorTaelin) February 20, 2026

This frustration is everywhere right now:

building AI agents is one thing

but most agents forget the context the very next hour

this is where memory comes into play, with verifiable AI call history

if your agent is not logging each and every output, you are basically losing 99.99% of the work its doing for you.
— 𝕜ꪖ𝕣ꪮડꫝⅈ (@0xKaroshi) March 9, 2026

Agents today are incredible… but they have no institutional memory.

They build things, forget they built them, and then gaslight you about it.

Until we have real state, memory, and governance layers, they’re just brilliant interns with amnesia.
— bhise (@elevenOOne_labs) March 10, 2026

The default answer has been memory. Agent memory files. Markdown summaries. Vector databases. But a few hours of real work already produces thousands of tokens. Compress that aggressively and you lose the important details. Don’t compress it enough and the context window becomes unusable.

The system either forgets what matters. Or overwhelms itself trying to remember everything.

What’s missing isn’t memory. It’s observation.

Agents don’t need to remember everything they said. They need to notice the patterns in what they did. Where they got stuck. What strategies failed. Which approaches actually worked.

From logs to telemetry

Those exact agent log lines stream through a tapes proxy into Kafka. From there, the data fans out:

Agent → tapes Proxy → Kafka (agent.telemetry.raw)
                          ├→ telemetry-consumer (JSONL)
                          ├→ Flink SQL (anomaly detection)
                          │    └→ Kafka (agent.telemetry.alerts)
                          └→ DuckDB (ad-hoc queries)

Flink SQL jobs run anomaly detection in real time:

STUCK_LOOP When the agent repeats the same position more than ten times within thirty seconds.

TOKEN_SPIKE When token usage suddenly jumps beyond twice the rolling average.

Alerts feed back into SQLite as nodes. The JSONL sink and DuckDB handle cross-session queries. Kafka and Flink for real-time reinforcement learning. JSONL and DuckDB for observational memory.

The agent isn’t just logging what happened. It’s generating the signals that guide the next run.

What the logs revealed

The hard parts of Pokémon Red were not what I expected. They were game mechanics invisible to an agent.

There is an 8-second cooldown on every door transition. The game enforces this so you don’t accidentally walk back through a door you just entered. The agent had no idea. It would walk through the bedroom door, immediately try to move, hit the cooldown, and interpret the lack of movement as being stuck. Hundreds of turns burned before I figured it out. Talking to NPCs turned out to be critical. Without NPC conversations, the agent wandered the map aimlessly. The context clues for what to do next, where to go, what items to find, all of that comes from dialogue.

All of this nuance had to be discovered through trial and error, logged in the Pokédex, and fed back into the next run. The telemetry pipeline is what made those discoveries persist. Without it, each generation of agents would rediscover the same traps from scratch.

The evolution loop

I took inspiration from ClaudePlaysPokémon and DeepMind’s AlphaEvolve paper. My initial approach was brute force in a single-threaded loop, but I quickly started spinning up 10 agents at a time to speed run and then learn.

The agent treats its navigation parameters (stuck threshold, door cooldown, waypoint skip distance, axis preference) as a genome. Each generation, 10 variants run in parallel. A fitness function scores them. The winner survives.

But before AlphaEvolve kicks in, I run a Factorial Learning Environment pass. FLE explores the space: if you take two steps forward and get blocked, take three steps back and reevaluate. The agent figures out what works and writes the successful route to JSON inside the Pokédex. The next time it runs, it already knows where the door is.

FLE builds the map. AlphaEvolve optimizes the route.

Cold start: 1 out of 10 generations improved. With historical telemetry: 4 out of 10.

Run	Historical entries	Gens improved	Final score
1	0	1/10	39,415
2	10	3/10	12,836
3	20+	3/10	17,319
4	30+	4/10	39,423

Run 4 explored for 7 generations before finding a breakthrough at Gen 8 by touching parameters no previous run had tried. The historical observer showed the standard parameter space was exhausted, which pushed the LLM to explore new dimensions. That’s the feedback loop in action: signals from past runs shaping the decisions of future ones.

Observational memory

The observer reads the tapes database after each run. It extracts noteworthy events through heuristic pattern matching: errors, file creations, token usage anomalies. It tags them by priority and writes them to markdown files alongside the database. No LLM calls required.

  Agent reads observations.md + historical_insights.md at startup
                        │
                        ▼
  Agent runs → tapes Proxy → Kafka (agent.telemetry.raw)
                                ├→ Flink (real-time alerts)
                                ├→ JSONL sink
                                └→ DuckDB (cross-session queries)
                                        │
               ┌────────────────────────┘
               ▼
  Historical Observer extracts patterns:
  token trends, recurring failures, efficiency deltas
               │
               ▼
  observations.md (single-session) + historical_insights.md (cross-session)
               │
               ▼
  Next generation reads these before mutating parameters

Battle memory works the same way. When the agent discovers a Pokémon’s weakness, it logs that in the Pokédex. The next time it encounters that Pokémon, it already knows the matchup. It sets up its team composition before the fight instead of discovering the weakness mid-battle. Losing a battle is the equivalent of a failed test. The agent goes back, reiterates, and comes back with a better loadout.

The readable log prefixes (NAV |, BATTLE |, BACKTRACK |, MILESTONE |) were originally added so a human could follow the terminal output. But they also made pattern matching reliable. Flink can detect loops because the structure is predictable. The observer can classify anomalies because the keywords are consistent. The observability system emerged from something designed for readability.

The gap

There is a clear gap. Every run eventually hits a plateau where the LLM proposes near-identical variants for multiple consecutive generations. The historical observer records the convergence but nothing acts on it yet. Run 4’s breakthrough happened despite that gap, not because of a designed escape mechanism.

Closing that loop, detecting convergence and injecting a diversification signal automatically, is the next step. That is the self-healing piece.

Sweeper: the same loop, applied to real codebases

Sweeper is a tool I built to show this pattern works beyond games. It takes Karpathy’s autoresearcher a step further.

You point it at any linter. It groups issues by file and fans out parallel Claude Code sub-agents. Each sub-agent is a stateless claude --print process. It reads one file, applies the fix, exits.

sweeper run --vm -c 10
                    │
          ┌─────────┼─────────┐
          ▼         ▼         ▼
    ┌──────────────────────────────┐
    │        Worker Pool           │
    │   (semaphore-bounded, N=10)  │
    └──┬───┬───┬───┬───┬───┬──────┘
       ▼   ▼   ▼   ▼   ▼   ▼
     ┌───┐┌───┐┌───┐┌───┐┌───┐
     │VM ││VM ││VM ││VM ││VM │ ◄── stereOS isolation
     └─┬─┘└─┬─┘└─┬─┘└─┬─┘└─┬─┘
    claude claude claude claude claude
       └─────┴──┬──┴─────┴─────┘
                │
      streaming + telemetry + tapes

Every sub-agent session records to tapes. Token spend per linter, strategy effectiveness by round, whether you’re trending toward more fixes with fewer tokens over time. Run sweeper observe after a sweep and you get actual data on what’s working.

Karpathy’s autoresearcher works a similar loop: agent edits code, runs experiment, evaluates, keeps or reverts. 100 experiments overnight on one GPU. Good design. But each run is stateless. Sweeper differs because tapes gives it observational memory. Stalled files get escalated based on what actually failed before, not a generic retry. Over time the system learns which approaches work for which categories of issues. That’s the difference between running experiments and accumulating knowledge.

I used contributor.info as a test case. 1,992 ESLint errors. Auto-fix handled 1,150 formatting issues. Sweeper dispatched 5 parallel agents across the remaining 842 errors in 99 files. Three rounds, 100% fix rate, ~54 minutes wall clock (PR #1741). Then I pointed it at the docs: 15 stale feature pages rewritten by parallel agents, each one taking a page, rewriting content, and adding screenshots from the live app (PR #1745). Same tool, different prompt. Sweeper doesn’t care if it’s fixing lint or rewriting prose.

Where this goes

The Pokémon agent wasn’t really about Pokémon. Sweeper wasn’t really about lint. Both were experiments in what happens when agents do real work and you actually record what they did.

The feedback loop, agent runs, telemetry persists, observer surfaces patterns, next run reads those patterns, applies well beyond games. What I’m building is essentially a runbook. A runbook to speed run Pokémon. If anybody needs to go beat Brock or Misty, the pathway is already mapped.

That same pattern applies to large-scale refactors where each PR is a generation. Sprint telemetry revealing which modules have the highest revision rates. 5,000-line files that need to be broken apart. The iteration loop stays the same: run, observe, learn, run again.

The question isn’t whether agents can run tasks. We already know they can. The question is whether they can get better every time they run.

Try it

tapes records every agent session as structured telemetry. Every API call, every tool use, every decision, stored in a local SQLite database you can query, replay, and feed into the next run. If your agents are forgetting everything between sessions, this is the fix.

stereOS gives each agent its own isolated VM. Own CPU, own memory, API keys injected into the VM and never touching the host filesystem. When you’re running 10 agents in parallel, you need failures and credentials to stay contained. Clean teardown on exit, success or failure.

Agents that record what they do can learn from what they did. That’s the whole idea.