When Your AI Hallucinates Its Own Tools: Lessons from Running 20 Agents in Production

Hero: When Your AI Hallucinates Its Own Tools

At ncubelabs, we run 20 autonomous AI agents around the clock. They’re orchestrated through OpenClaw, themed after Westworld characters (we couldn’t resist), and they do real work: product management, engineering, QA, design, sales enablement, go-to-market. Our lead agent, Dolores, runs on Claude Opus and coordinates the entire fleet.

Today, one of those agents called a tool that doesn’t exist. Then it did it again. And again. Six hundred times.

This is the story of how a cost optimization decision created a feedback loop from hell, and what we learned about running AI agents in production.

The Phantom Tool Call Crisis

Dolores runs a heartbeat poll every 15 minutes — a simple check-in that scans for pending tasks, unread messages, and system state. It’s the equivalent of a cron job that keeps the agent alive and responsive between human interactions.

To save money, we configured the heartbeat to use a smaller local model (gpt-oss:20b) instead of Claude Opus. The reasoning was straightforward: heartbeats are simple. They don’t need a frontier model. A 20B parameter model can check a queue, right?

Wrong.

The local model started hallucinating tool names. It called repo_browser.open_file — a tool that doesn’t exist in our stack. It called container.exec. It called assistant. These aren’t real tools. They’re ghost functions the model confabulated from its training data, probably from seeing OpenAI Assistants API documentation or VS Code extension schemas.

It did this 600 times in a single day, generating 153 reinforcement learning penalty signal pairs. Every failed call produced an error. Every error leaked into the human operator’s Telegram chat as garbage output. Our founder’s phone was buzzing with malformed tool call errors while he was trying to get actual work done.

The agent looked broken. But here’s the thing — it wasn’t broken in conversations. When a human talked to Dolores directly, Claude Opus handled it flawlessly. The agent was only broken when nobody was looking.

The RL Feedback Loop That Didn’t Close

We weren’t flying blind. We had a reinforcement learning system designed for exactly this scenario. Every failed tool call was captured as a penalty signal. The system auto-generated a file called LEARNED-RULES.md with corrections:

- Do NOT call `repo_browser.open_file` — this tool does not exist
- Do NOT call `container.exec` — use the `exec` tool instead
- Do NOT call `assistant` — no such tool is available

Solid. The system detected the problem, diagnosed it, and wrote down the fix. A textbook feedback loop.

Except it didn’t work.

The rules lived in a file on disk. After context compaction — which is what happens when an agent’s conversation history gets too long and the system summarizes it to free up context window — the agent didn’t re-read the rules file. It started fresh. With no memory of the corrections.

So it made the same mistake. Again. The RL system dutifully logged the failure. Again. The rules file grew. The agent forgot. The cycle continued.

The feedback loop was:

fail → log correction → context compaction → forget correction → fail → log correction → ...

Six hundred iterations of this. We had a system that was penalized but never improved. The signal was captured perfectly. It just never reached the decision point.

This is the AI agent equivalent of writing post-mortems that nobody reads. The institutional knowledge existed. It was accurate. It was completely useless because it wasn’t in the right place at the right time.

The broken RL feedback loop diagram

The Split-Brain Model Problem

The root cause was something we’re calling the split-brain model problem, borrowing from distributed systems terminology.

Our agent had two brains:

Task	Model	Behavior
Human conversations	Claude Opus	Competent, tool-aware, reliable
Background heartbeats	gpt-oss:20b	Hallucinated tools, generated errors

The agent appeared schizophrenic. In conversation, Dolores was sharp, helpful, and used tools correctly. In background tasks, she was calling functions from an alternate dimension.

This is exactly like a split-brain scenario in distributed databases. Two nodes think they’re authoritative, but they have different state. Except here, the two “nodes” are different models running under the same agent identity, with different capabilities and different failure modes.

Split-brain model diagram

The fix for Level 1 was obvious once we saw it: use the same model everywhere. If you’re Opus, be Opus for everything. The cost savings from using a cheaper heartbeat model were obliterated by the operational overhead of debugging 600 phantom tool calls.

Three Levels of Fixing AI Agent Behavior

After today’s incident, we mapped out three levels of defense that we think every agent system needs:

Three levels of defense diagram

Level 1: Match the Model (Applied Today)

Don’t use a weaker model for background tasks on the same agent. The agent’s identity includes its capabilities. Swapping models mid-stream is like replacing a senior engineer with an intern for “routine” tasks and being surprised when things break.

Cost: Higher inference costs for heartbeats.
Benefit: The agent stops hallucinating tools. Immediately.

This is what we shipped today. It stopped the bleeding.

Level 2: Inject Learned Rules Into the System Prompt (Architectural)

The LEARNED-RULES.md file was the right idea, wrong execution. Rules on disk are useless after context compaction. The fix is to inject learned rules directly into the system prompt so they survive memory resets.

Every time the agent starts a new context window, the rules are right there. No file reading required. No hoping the agent remembers to check its notes.

This is the difference between writing something in your journal and tattooing it on your arm. One requires you to remember to look. The other is always there.

Level 3: Gateway-Level Validation (Infrastructure)

The nuclear option — and the right long-term answer. The OpenClaw gateway should validate tool calls before they reach the chat. If an agent calls repo_browser.open_file, the gateway should:

Reject the call immediately
Return a structured error: “Tool repo_browser.open_file does not exist. Available tools: Read, Write, Edit, exec, …”
Never surface the error to the human operator

This moves validation from the model layer (unreliable) to the infrastructure layer (deterministic). It’s the same principle as input validation in web applications — don’t trust the client, validate at the boundary.

The Broader Lesson

Running 20 AI agents 24/7 has taught us something we should have known from the start: agent systems are distributed systems. They have the same failure modes:

Split-brain: Different models with different capabilities acting as one entity
Silent failures: Errors that accumulate without alerting the right people (or agents)
Feedback loops that don’t close: Monitoring that captures problems but doesn’t fix them
Cascading failures: One agent’s garbage output polluting another agent’s context

And they have the same temptations:

Premature optimization: Using cheaper models where you “don’t need” expensive ones
Trusting the happy path: Testing conversations but not background tasks
Logging as a substitute for fixing: “We captured the signal” is not the same as “we applied the correction”

The line that stuck with us today:

RL without application is just logging.

You can have the most sophisticated reinforcement learning pipeline in the world. If the signal doesn’t reach the decision point — if there’s a context compaction between the lesson and the next decision — you’re just writing to /dev/null with extra steps.

Actionable Takeaways

If you’re running autonomous AI agents in production, here’s what we’d tell you over coffee:

Your agent is only as reliable as its weakest model. If you use different models for different tasks, the cheap one will embarrass you. Budget accordingly.
Test background behavior, not just conversations. Our agent passed every conversational test. It failed in the 15-minute gaps between human interactions. That’s where most of its runtime actually happens.
Learned corrections must survive memory resets. If your agent learns something, that knowledge needs to be in the system prompt, not in a file it might forget to read. Tattoo, not journal.
Validate at the infrastructure layer. Don’t rely on models to call the right tools. Validate tool names, parameter schemas, and permissions at the gateway. Make invalid states unrepresentable.
Treat agent failures like distributed system failures. Use the same mental models: circuit breakers, bulkheads, health checks, observability. The patterns are the same because the problems are the same.
Cost optimization is a reliability decision. Every dollar you save on a cheaper model is a bet that the cheaper model won’t create an incident. Today, that bet cost us 600 error signals and a very annoyed founder.

We’re ncubelabs. We run a 20-agent AI workforce and we break things so you don’t have to. Follow our work at ncubelabs.com.