Our AI Hallucinated in Production: How We Fixed It With Evals — Yicheng Guo at AI Engineer Melbourne 2026
Our AI Hallucinated in Production: How We Fixed It With Evals
There's something particularly jarring about discovering that an AI system you deployed to production is making things up. Not failing gracefully, not returning errors, but confidently generating false information and presenting it as fact.
This is what happened at REA Group, one of Australia's largest property platforms. An AI system intended to help with real estate operations started hallucinating, confidently asserting property details, market information, or procedural steps that were completely fabricated. The damage isn't immediately obvious — some users noticed, some didn't — but the risk is clear. Property decisions made on false information can be costly.
The instinctive response to hallucinations is often to reach for more sophisticated models or more fine-tuning. But Yicheng Guo's team took a different approach: they built an evaluation framework to systematically understand when and how their system hallucinates, and then used those insights to prevent hallucinations at scale.
The key insight is that hallucinations aren't random. They follow patterns. A system might hallucinate reliably in certain domains (property history), under certain conditions (when asked about rare properties), or when processing certain types of input. If you can characterise those patterns through evaluation, you can design around them.
Building an effective evaluation framework requires specificity. Generic "does the model hallucinate?" evaluations are almost useless. You need to ask: Does it hallucinate in this specific domain? Does it hallucinate when information is incomplete? Does it hallucinate when asked to extrapolate beyond training data? Can it distinguish between what it knows and what it's guessing?
For REA Group's use case, that meant building evals that tested the system against known property data, against edge cases where information was incomplete or conflicting, and against scenarios designed to trigger hallucinations. The evals needed to be specific enough to be actionable — not just "is this true?" but "if we deployed this system with this configuration in this use case, what percentage of outputs would contain false information?"
This revealed something important: the system's hallucination rate wasn't uniform. It was much worse in certain domains and under certain conditions. That knowledge enabled targeted interventions: restricting the system's scope to domains where it performed reliably, building guardrails that prevented it from answering questions outside its reliable zone, or augmenting it with retrieval systems that reduced the need for generation and the corresponding risk of hallucination.
The framework also created a feedback loop. By systematically evaluating deployed systems against real-world outcomes, the team could detect when performance degraded and investigate why. They could catch hallucination drift before it affected large numbers of users.
This approach is valuable because it's honest about the capabilities and limitations of current AI systems. You're not trying to eliminate hallucination entirely (which is likely impossible). You're characterising it, understanding it, and managing it. You're building systems that work reliably within defined boundaries and fail safely outside those boundaries.
It's also practical. Most organisations can't build custom models or undertake massive fine-tuning projects. But most organisations can build evaluation frameworks and use those frameworks to make better decisions about how to deploy and constrain existing systems.
This matters now because hallucination is being deployed into production everywhere. Teams are building systems that generate information used in real decisions without always understanding how often those systems are making things up. The gap between "this system sometimes hallucinates" and "we've characterised when and how it hallucinates and built safeguards accordingly" is the difference between hoping for the best and actually managing risk.
Building the evaluation discipline that catches hallucination in testing, not in production, is the kind of practical engineering work that separates systems that work from systems that fail.
Yicheng Guo, Senior Machine Learning Engineer at REA Group, is presenting this talk at AI Engineer Melbourne 2026 on June 3-4.
Great reading, every weekend.
We round up the best writing about the web and send it your way each Friday.
