Your Agents Pass Every Benchmark—Then Memory Breaks Them in Production — Ananya Roy at AI Engineer Melbourne 2026
Your Agents Pass Every Benchmark—Then Memory Breaks Them in Production
There's a pattern that's starting to repeat across organizations experimenting with AI agents. The benchmarks look phenomenal. In controlled testing environments, the agent answers questions correctly, completes tasks reliably, and behaves exactly as expected. Your evaluation metrics are solid. Leadership is impressed.
Then you deploy to production.
The agent breaks in ways that are hard to predict and harder to fix. Maybe it gets confused by contradictory information in a conversation. Maybe it loses track of what the user asked three messages ago. Maybe it hallucinates context, making confident assertions about things that were never discussed. Maybe the same input produces wildly different outputs because the agent's state management is fragile.
These failures aren't happening because the model is dumb or the prompt is bad. They're happening because memory—the ability to maintain coherent state across time and context—is fundamentally broken at scale.
Benchmarks are usually stateless. A single query, a single turn. The agent reads the input, generates the output, and the conversation is over. Reality is different. Real conversations have history. Users ask follow-up questions. They contradict themselves. They expect the agent to remember what was said ten messages ago. They feed the agent contradictory instructions to see which one it listens to. In the real world, agents operate in a soup of context that's messy, ambiguous, and constantly growing.
This is where the complexity explodes. As a conversation grows, the context window fills up. What does the agent drop? What does it keep? How does it decide what's relevant? If you keep everything, eventually you run out of tokens. If you drop things, you might drop something critical. And if you try to summarize the conversation to make room for more, the summary itself becomes a source of errors—details get lost or misrepresented.
Then there's the problem of contradictions. Users change their minds. They say conflicting things. They test the agent to see if it will remember earlier statements that contradicted later ones. A human in a conversation handles this fluidly, usually by asking clarifying questions. An agent doesn't have that instinct. It either follows the latest instruction or gets confused and crashes.
The scale problem compounds quickly. An agent handling three simultaneous conversations can manage context. At three thousand, state management breaks down. Concurrency bugs, race conditions, and state bleed between conversations become common.
What works requires discipline. It means accepting that agents can't just accumulate unbounded context. It means designing memory systems explicitly—deciding what gets remembered, what gets summarized, what gets discarded. It means building monitoring that detects when agents are losing coherence. It means testing in conditions that look more like production: long conversations, contradictory instructions, parallel requests.
The teams that succeed here are the ones that stop treating memory as something the LLM handles automatically and start treating it as a system design problem. They're instrumenting their agents to understand what's breaking. They're experimenting with architectures that separate short-term context from long-term memory. They're building retrieval systems that let agents find relevant history without being overwhelmed by irrelevant details.
Ananya Roy is digging into exactly these production problems at AI Engineer Melbourne 2026 on June 3-4, sharing what actually works when agents need to remember and manage state at the scale of real deployments.
Great reading, every weekend.
We round up the best writing about the web and send it your way each Friday.
