Web Directions Conferences (and more)

Stop vibing your agents to production: applying ML discipline to agent development — Justin Barias at AI Engineer Melbourne 2026

john allsopp 17th April, 2026

Stop Vibing Your Agents to Production: Applying ML Discipline to Agent Development

There's a pattern in early-stage agent deployments that's become disturbingly common: someone gets an idea for what an agent should do, they start tinkering with prompts and parameters, they tweak and adjust until it seems to work, and then they ship it to production with crossed fingers and a hope that it keeps working.

This is the "vibe-based" approach to agent development. It's understandable. Agents are new. The territory is unmapped. There's a sense of experimentation and discovery. But it's also deeply risky. You're essentially deploying an unvalidated system to handle real tasks with real consequences.

This wouldn't fly with traditional machine learning. You don't train a model, give it a once-over, and push it to production. You'd build a comprehensive evaluation framework. You'd measure performance on held-out test data. You'd monitor it in production and watch for drift. You'd version everything so you could roll back if something breaks. You'd have clear criteria for when a model is acceptable and when it's not.

But agents developed through manual prompt engineering often bypass all of this. There's no systematic evaluation. There's no versioning. There's no monitoring that would catch degradation. There's definitely no rollback strategy if someone tries a new prompt and it breaks everything.

The consequence is systems in production that no one truly understands. Why does the agent work for this case but not that one? What changed between the version that was fine yesterday and the one that's broken today? If the agent is producing bad outputs, is it the prompt, the model, the retrieval system, or something else entirely? Without systematic evaluation, these questions become impossible to answer.

This is where ML discipline becomes essential. Not because agents are just another ML problem—they're not—but because the same principles apply: you need to know what you're measuring, you need to measure it consistently, and you need to have confidence that your system is actually doing what it's supposed to do.

Start with evaluation. What does success actually look like for your agent? Not vague notions like "it works better." Concrete, measurable criteria. If it's a customer service agent, maybe that's resolution rate and customer satisfaction scores. If it's a code review assistant, maybe that's accuracy of identified bugs and false positive rate. Define these metrics clearly. Build test cases that measure them. Run your agent against those test cases every time you make a change.

Version everything. The prompt, the model, the retrieval system, the configuration. Not just the latest version, but a clear record of what changed and why. This gives you the ability to understand what caused a regression if one appears. It gives you the ability to roll back if something breaks.

Monitor in production. What does your agent actually do when it's running? What questions does it get asked? How often does it fail? What patterns appear in the failures? This requires instrumentation—logging that captures enough information to understand what's happening without capturing sensitive data. It requires dashboards that show you whether your agent is behaving normally.

Implement systematic improvement. When you find failure modes, document them. Build test cases around them. Fix the underlying issue. Measure whether your fix actually worked. This is how you go from vibe-based tweaking to genuine improvement.

Organizations that succeed will treat agent development like serious engineering. Not over-engineering that slows everything down, but the kind of discipline that means you can ship confidently, understand what's happening in production, and continuously improve based on actual data.

Justin Barias is bringing this rigorous perspective from large-scale government AI systems to AI Engineer Melbourne 2026 on June 3-4, showing how to move agent development beyond vibe-based experimentation into systematic, measurable improvement.

Web Directions Year round learning for product, design and engineering professionals

Stop vibing your agents to production: applying ML discipline to agent development — Justin Barias at AI Engineer Melbourne 2026

Stop Vibing Your Agents to Production: Applying ML Discipline to Agent Development

delivering year round learning for front end and full stack professionals