AI Engineer Melbourne

Evaluation Precedes Evolution: Rubrics as the Load-Bearing Infrastructure of Self-Improving Agents

The 2025–2026 wave of “self-evolving“ agents — prompt-tuning loops, memory accumulation, agent swarms, GEPA, ReasoningBank — share a structure that is sometimes lost in the jargon: every one of them is hill-climbing on a judge. The judge is the fitness function. When it’s sharp, the agent compounds. When it’s vague, the loop drifts confidently in the wrong direction.

This talk argues that rubrics, not prompts or scaffolds, are the load-bearing infrastructure of agent improvement. We’ll walk through three concrete failures from recent work: prompt optimizers that regressed without rollback (OpenAI), memory systems that hurt performance as they grew (ReasoningBank), and 18 months of capability gains that delivered almost no reliability gain (Princeton). All three share a root cause: the rubric was the bottleneck, and nobody was looking at it.

Then we’ll build one. Five principles for a rubric that can actually drive evolution — stack deterministic before semantic, score failures explicitly, measure beyond accuracy, version the rubric itself, keep it cheap. You’ll leave with a checklist you can apply to your next agent before you ship a single optimization loop.

Tanya Dixit

Tanya Dixit is a Forward Deployed Engineer at Google, partnering with enterprise customers across APAC to ship production AI systems. Her work spans agentic AI, voice AI, and multimodal architectures, with deep focus on banking, financial services, and healthcare. She also supports Google’s university partnerships program in healthcare AI. Based in Sydney, Tanya writes and speaks regularly on moving voice and agent systems from demo to production reliability.