Year round learning for product, design and engineering professionals

What we learned at the Melbourne AI Unconference

A Saturday at Stone & Chalk, four parallel tracks, ~50 practitioners, no speakers, no agenda — and a surprising amount of signal.

On Saturday 11 April 2026 we ran an AI Engineer unconference in Melbourne. No keynotes, no slide decks rehearsed for weeks, no panel of the usual suspects. Just a room full of engineers, founders, consultants, researchers, data scientists and product people, a stack of cue cards, four corners of the venue and the law of two feet — if a session isn’t useful to you, please leave; it’s not rude, it’s the method.

What came out of it was the kind of conversation you can’t manufacture. Below is a short tour of what we heard. If any of it sparks something, the full report is a much longer, deeper write‑up — themes, ideas, quotes, references, per‑session deep dives, and identifications of the people, papers and projects attendees referenced through the day.

Other write-ups

Ben Hogan–What I Learned at Melbourne’s AI Engineering Unconference

Lessons from Melbourne’s AI Engineering Unconference: evals consume 50% of engineering time, synthetic scenarios unlock tacit knowledge from domain experts, and the real bottleneck in AI engineering is organisational — not technical.

Javier Candeira–Evals and LLMs as Judges

Session notes. Content not attributed; if you know, you know


Banner for AI engineer conference Melbourne. Text reads: "AI Engineer Melbourne, June 3rd and 4th, 2026, Federation Square."

If this seems valuable to you, we have AI Engineer Melbourne coming up in June 2026 focused on these topics and more.


The bottleneck has moved

The single most consistent observation across every technical session: AI has made it trivial to produce code, so the real work has moved to either side of the keyboard.

Upstream, the hard part is now eliciting tacit decision logic from time‑poor subject matter experts who often cannot articulate their own rules beyond “when I see it, I’ll know.” Downstream, code review, PR review, UAT and compliance sign‑off become the new bottleneck — especially in domains like payroll or ISO compliance where correctness is non‑negotiable. One participant called the effect “a tsunami of code.” Another, half‑joking, half‑not: “AI amplifies pre‑existing organisational problems. If teams don’t understand the problem, AI makes it easier to produce more rubbish, faster.”

The corollary, raised in the spec‑driven development conversation: if code can be regenerated from a markdown spec, the spec — not the code — becomes the source of truth. And the audience for that spec is increasingly the agent, not the human. Which means we now need specs precise enough that two different agents could implement the same feature differently and still produce functionally equivalent outcomes. It’s a higher bar than most teams’ specs clear today.

“Local” is not one thing

The local‑models conversation was probably the strongest single session of the day, and it spent its first 20 minutes refusing to let “local” collapse into “runs on my laptop.” At least four meanings need to be kept separate:

  • On‑device — phones, laptops, Apple’s Neural Engine, and (potentially soon) WebGPU pushing models into the browser.
  • Open weights / open source — control over model behaviour, safety filters, and the ability to inspect what you’re running.
  • Bounded or sovereign deployment — ad‑hoc per‑project stacks, not necessarily on hardware you own. One attendee uses NixOS to spin inference up locally, on Groq, or wherever the project demands.
  • Air‑gapped — for hard privacy.

The warning that closed the loop is one to internalise: “privacy is a property of the system, not one part of the system.” A locally running model with web‑connected tools can be more vulnerable than a well‑governed cloud API.

The economic frame was striking too. One prediction that landed: “tokens become the currency of 2026.” As cloud token prices rise and on‑device compute keeps getting better, hybrid routing — local first, escalate to a frontier model on complexity — is shaping up to be the near‑term default architecture.

On agents: don’t let them roam

The richest engineering insight of the day, from the agentic architecture session: if you already know the shape of the work, don’t use a free‑roaming agent.

Treat the process as a deterministic workflow, and call the LLM only at the specific steps that need it. Force structured outputs (JSON with enums) so failures are detectable immediately. Cap runs. Externalise state — long tool loops re‑send context on every iteration, so cost grows quadratically and very long context windows actively degrade quality (“it gets stupider” was the phrase). For multi‑agent systems, lightweight handoff patterns like “I was here” files beat per‑agent persistent memory.

One participant put the whole thing crisply: “You do not need to let the agent do everything. Maybe you use AI to write a program that runs your fixed workflow — even faster and more predictable.” Sometimes the right use of AI is to write the deterministic code, not to be the runtime.

The end of vibes‑based testing

The evals session opened with the blunt observation that “a lot of people just do vibes‑based testing and there’s clearly no future” in that, and proceeded to map out a working playbook:

  • Build an eval UI for non‑technical reviewers. Colour‑code agent‑generated spans green. Use pass / fail / needs‑review / irrelevant. Avoid the word confidence — users read it as a meaningful percentage even when it isn’t.
  • Bootstrap with synthetic scenarios. Have the model play both sides of 100 fake conversations and hand them to a human assessor.
  • Generate with a strong model (Sonnet‑class), judge with a cheap one (Haiku‑class) — where nuance permits.
  • Treat evals as analytics. The annotations and traces are second‑order data; the same way click instrumentation made web apps measurable, evals make agent behaviour measurable.

And the sharp question the room couldn’t fully answer, raised by one of the learners: “if I saw a code base that only had tests for the happy path, I’d be like, are these tests even testing the code? What failure modes do your evals actually catch — and what failure modes come from bad evals?” Worth sitting with.

Authorship, disclosure and what we owe each other

The day closed with the most original conceptual contribution: a proposal for a non‑judgmental disclosure language for AI involvement in creative and technical work, building on Daniel Miessler’s AI Influence Level and motivated by Cory Doctorow’s recent piece on non‑consensual AI slop.

The pitch: declare your outputs (words, graphics, code, tests — each at a level from 0 = entirely human to 5 = entirely AI), and declare the inputs you’ll accept from contributors. It’s not about banning AI. It’s about people with different comfort levels still being able to collaborate without ending relationships. The metaphor that did the work: food labels. “This is not about hating peanuts. It’s about you knowing that they’re peanuts.” The intellectual lineage is honest about itself — Datasheets for Datasets and Model Cards for Model Reporting used the food‑label metaphor first for ML transparency.

The driving analogy was even sharper: AI‑use preferences are like driving speeds. Everyone thinks their own speed is correct; faster drivers are lunatics, slower drivers are inconsiderate. A shared tag lets people coordinate without either side needing to convert the other.

A few things to take home

If you only take five things from the day:

  • Write the spec for the agent, not the human — and include the non‑functional requirements explicitly.
  • Force structured outputs so failures are immediately detectable and retryable.
  • Don’t use a roaming agent if a deterministic workflow will do — call the LLM at specific steps.
  • Build evals that non‑technical reviewers can actually use — pass/fail, colour‑coded traces, no “confidence” scores.
  • Treat privacy as a system property — a local model with the wrong tool wiring is not private.

Read the full report

The longer write‑up has the details — nine cross‑cutting themes, ~20 actionable ideas, curated quotes, open questions, identifications of the people, papers, books, courses, articles and ~50 tools/projects referenced through the day, and a deep dive into each session.

Read the full report

All sessions ran under Chatham House Rule. Attendees in the report are anonymised; only third parties they referenced (Donald Horne, Andrej Karpathy, Daniel Miessler, Cory Doctorow, Hamel Husain, Shreya Shankar, Michael Feathers, Timnit Gebru, Margaret Mitchell, Henri‑Paul Motte and others) are identified, with sources.

Thanks to Stone & Chalk for the venue, MLAI Australia for the partnership, and to everyone who showed up, wrote on a card, dot‑voted, demoed something, and trusted strangers in a room with a recording phone in the middle of it. We’ll do it again.

If you want more of this kind of thing, the AI Engineer conference is in Melbourne 3–4 June, the Energy Hackathon follows on Friday 5 June, and there’s a week of evening events around it.

delivering year round learning for front end and full stack professionals

Learn more about us

Web Directions South is the must-attend event of the year for anyone serious about web development

Phil Whitehouse General Manager, DT Sydney