Agent Played Pokémon for 1,000 Turns. It Never Left the Bedroom

I grew up playing Pokémon Red on a Game Boy Color. I had one strategy: get to Oak’s lab as fast as possible, pick a starter, and start grinding. No talking to NPCs. No exploring. Just speed.

So when I sat down to build an autonomous Pokémon Red agent, I had the same plan. Get to Oak’s lab. Choose Squirtle. Beat the rival. Simple milestones with a clear path.

The agent played for 1,000 turns. It never left the bedroom.

The First Evening

Getting the agent running inside a stereOS VM took most of the first evening. The interesting part was the shared mount — the VM maps the host directory into /workspace using virtio-fs, so frames and logs the agent writes inside the VM appear instantly on my Mac. But the host files are owned by macOS UID 501 and the VM runs as UID 1000, so output directories need open permissions or writes silently fail. That bidirectional mount is what makes the whole feedback loop work: the agent runs headlessly inside the VM, and I watch its screenshots and session logs update in real time on the host.

Once it booted, the agent completed Oak’s intro and landed in the bedroom. Then it stopped moving.

Red standing in his bedroom, stuck checking the SNES for 1,000 turns

I watched the logs scroll:

OVERWORLD | Map: 38 | Pos: (3, 6) | Action: a | Stuck: 0
OVERWORLD | Map: 38 | Pos: (3, 6) | Action: a | Stuck: 0
OVERWORLD | Map: 38 | Pos: (3, 6) | Action: a | Stuck: 0

A thousand times. Same position. Same action. Stuck counter at zero — the agent didn’t even know it was stuck. It was pressing A on an empty room, convinced a text box was open. The game was waiting for Red to walk. The agent was waiting for a dialogue that didn’t exist.

The problem was a single memory address. The agent read 0xC4F2 to detect text boxes. After the intro, that address held value 16. The agent treated any nonzero value as “text box active, press A.” But 0xC4F2 isn’t a game state flag. It’s a position in the Game Boy’s background tile map. The agent couldn’t tell a floor tile from a word blob.

Walking in Circles

The Pokémon Red agent spinning in Red's bedroom, stuck at position (3, 6) for 1,000 turns

Fixing the memory address — swapping 0xC4F2 for the actual game state register wd730 — got the agent moving. But moving isn’t navigating.

The agent tried to traveling-salesman its way to Oak’s lab without talking to anyone. It picked a direction, walked until it hit a wall, picked another direction. No pathfinding. No map awareness. Just random walks with a vague compass heading toward the lab. It walked out the front door and immediately walked back in. Out. In. Out. In. The door was on the path to the nav target, and the agent had no concept of “I was just in there.”

I started logging conversations for context clues. Where had the agent been? What did it see? What decisions did it make at each fork? Without structured session data, debugging was archaeology — reading raw frame dumps and trying to reconstruct intent from button presses.

This is when I wired up tapes. Every interaction proxied and recorded. Every decision searchable. I could replay a stuck sequence and see exactly which memory read produced which action. The door loop became obvious in thirty seconds of session replay — what had taken an hour to find from frame screenshots.

The Second Evening

The second evening was about the algorithm. With telemetry in place, every bug had a paper trail.

First came a collision map that read PyBoy’s walkability grid directly — the agent could see walls before hitting them. Blind direction-cycling gave way to A* pathfinding. A door cooldown prevented re-entering buildings just exited. Oscillation detection tracked the last eight positions instead of just the previous one.

Next, an observation layer on top of tapes. A pure-stdlib reader that parses the session database, extracts errors, stuck events, and token usage, and writes prioritized observations the agent can read back. The agent started learning from its own history.

The breakthrough was speed. PyBoy’s headless mode removes the 60fps cap and all rendering. The emulator runs roughly 100x faster than real-time. An entire run from boot to starter selection — what took me hours as a kid — finished in three seconds.

The agent selecting a starter Pokémon in Oak's lab

That unlocked everything. An AlphaEvolve-inspired evolution harness proposes parameter variants, races ten agents in parallel, and selects winners by fitness score. Door cooldown optimized from 8 turns down to 2 across two generations. All ten agents beat the rival. The winner finished in 650 turns.

The agent's Squirtle battling the rival's Bulbasaur

The Milestones

I published the agent as an open source project with speed run milestones. The agent supports three strategy tiers — low (pure heuristic, no API calls), medium (LLM fallback when stuck), and high (LLM every turn). Fork it, improve the strategy, post your numbers:

Milestone	Low	Medium	High
Get Squirtle + beat rival	~200	~200	~200
Reach Viridian City	~2,000	~1,000	~500
Reach Pewter City	~5,000	~3,000	~1,500
Beat Brock (1st gym)	~8,000	~5,000	~3,000
Clear Mt. Moon	~20,000	~10,000	~5,000
Elite Four	~300,000	~150,000	~80,000

What This Taught Me

Every failure in this project was invisible until I could see the state. The wrong memory address looked like the agent was working. The door loop looked like exploration. The oscillation looked like movement. Silent failures that present as progress are the hardest bugs in any system.

I’ve written about this pattern before — in Agents Need Black Box Recorders and If You Can’t See What Your Agent Did, Who’s Going Trust It?. The Pokémon agent made the theory concrete. The trajectory I described maps exactly to what happened over two evenings:

1.Recovery — replay what the agent saw. Find the wrong memory address.
2.Diagnosis — structured logs reveal the false text-box signal, the broken API, the oscillation pattern.
3.Prevention — collision maps and stuck detection catch failures before they loop.
4.Self-healing — the agent reads its own tapes, learns from previous sessions, and evolves its own parameters.

“I didn’t fix the agent by writing better prompts. I fixed it by recording everything and reading it back.”

What’s Next

The evolution harness is the beginning of something larger. Right now it works like this: the LLM proposes one parameter variant, the agent runs it, compares to baseline, keeps it if better. That’s the simplest version of a full evolutionary framework — and it’s already the foundation of how I think about agent improvement at Paper Compute.

The next step is closing the loop entirely. An agent that records its own sessions, observes its own failures, proposes its own mutations, and evaluates itself — without a human in the middle. Not a smarter model. A system that gets smarter by watching itself.

The Pokémon agent is a toy. The pattern isn’t. Every agent in production will hit its own version of 0xC4F2 — a silent failure that looks like success until someone checks the logs. The question is whether those logs exist.

Tapes started as the debugging tool for this agent. The problem turned out to be universal. It sits between your agent and your model provider, recording every request and response. No code changes. No SDK. If it happened, it’s recorded. The repo is open source.

You can’t heal what you can’t see. Start by seeing everything.