Our AI Hallucinated in Production: How We Fixed It With Evals
We shipped one of REA Group’s first generative AI features to production: Property Highlights, which turns long real-estate listings into three skimmable takeaways. The demo was easy; real traffic wasn’t—hallucinations showed up in front of real users.
This talk covers how we built an evaluation stack to launch safely at scale. Basic guardrails (three bullets, length limits) didn’t catch the failures that mattered: made-up features, off-brand tone, and useless copy. We built a review tool for side-by-side prompt/model testing, defined a rubric for factuality, usefulness, and language quality, and scaled it with an LLM-as-judge calibrated to expert reviews to score thousands of listings daily. We then tied evals to real user feedback and business metrics, including a 10% engagement lift.
You’ll get a practical pipeline and a repeatable way to iterate on LLM features using evals, not vibes.
Yicheng Guo
Yicheng Guo is a Senior Machine Learning Engineer at REA Group, delivering AI features for Australia’s largest property portal, realestate.com.au. He previously worked at Google and spent a decade in DevOps/SRE and systems engineering, specialising in productionising and scaling AI products to millions of users with a focus on reliability and security.
At REA, Yicheng led engineering for generative AI features, agentic systems, and image-processing pipelines, and helped build the evaluation stack that enables safe deployment to millions of property seekers. He prioritises practical, production-grade solutions over theory.
Yicheng regularly shares lessons from deploying AI at scale through internal and external conferences.