AI Engineer Melbourne

Your Agents Pass Every Benchmark—Then Memory Breaks Them in Production

You add memory to your agent, it works great in testing, and you ship it. A few weeks later, outputs start getting worse and nobody can figure out why. The agent is pulling in old information that’s no longer true, retrieving context that’s loosely related but clutters its reasoning, and sometimes carrying forward bad data that quietly corrupts every response after it. Standard evals won’t catch any of this because they test single turns, not how memory behaves over hundreds of sessions. In this talk, we will walk through practical design principles and evaluation patterns you can implement to detect memory degradation before your users notice it. You’ll walk away knowing how to design and evaluate memory enabled agents so it actually makes your agent more reliable instead of silently breaking it.

Ananya Roy

Ananya Roy is a Staff AI/ML Specialist Solution Architect at Databricks, based in Sydney. She works directly with enterprise AI teams across globe on the hardest part of agent development — getting them to work reliably in production, not just in POCs. She spends most of her time on agent evaluation, reliability patterns, and the failure modes that only show up after deployment. She has published technical work on responsible AI, agent evaluation pipelines, and autonomous AI assistants, and has spoken at Data + AI Summit, Data Intelligence Day, and other technical conferences. Before Databricks, she was at AWS helping customers build and ship AI solutions.