AI Engineer Melbourne

we fired our LLM judge

Every team building with LLMs hits the same wall. Someone in the room asks: “Why are we doing evals? Isn’t this slowing us down?”

And honestly if your evals aren’t tied to anything real, they probably are. You’re running benchmarks, getting green ticks, and your agent is still giving users wrong answers in production. That’s the trap of generic evals: they feel like progress but they’re just noise.

This talk is about how we built evals for a production agentic AI system in financial services. Real customers, real money, regulated environment. And then how those same evals became the engine for something most teams don’t get to: replacing our expensive LLM judge with a tiny distilled model that now runs on 100% of production traffic.

Abdul Karim

Abdul Karim is Principal Applied Scientist — Machine Learning Data Flywheel at Commonwealth Bank, where he designs evaluation and data flywheel systems for production agentic LLM applications in financial services. Previously he was Senior Applied Scientist at Microsoft, building eval pipelines and fine-tuning LLMs and SLMs for enterprise customers, and AI Research Lead at Leonardo.AI, where he helped architect the Phoenix foundational image generation model. He holds a PhD in Computer Science from Griffith University and an MSc from GIST, South Korea.

Jack Silman

Experienced software engineer and technical lead focused on building great customer experiences through applied AI.