Melbourne AI Engineer Unconference — Report of the Day
Date: Saturday 11 April 2026
Venue: Stone & Chalk, Melbourne
Format: Open Space Technology — three 45‑minute sessions across four parallel tracks, plus opening, demos and closing
Participation: ~50 practitioners (engineers, founders, consultants, researchers, data scientists, product people) across AI engineering, software engineering with AI, privacy/security, and people/skills
Conduct: Chatham House Rule — nothing in this document is attributed to identified attendees. Speakers are anonymised by default; third parties they referenced (authors, researchers, public figures) are identified where useful.
Other write-ups from participants
Ben Hogan–What I Learned at Melbourne’s AI Engineering Unconference
Lessons from Melbourne’s AI Engineering Unconference: evals consume 50% of engineering time, synthetic scenarios unlock tacit knowledge from domain experts, and the real bottleneck in AI engineering is organisational — not technical.
Javier Candeira–Evals and LLMs as Judges
Session notes. Content not attributed; if you know, you know
If this seems valuable to you, we have AI Engineer Melbourne coming up in June 2026 focused on these topics and more.
2. Cross‑cutting themes
Nine themes ran through more than one session. Each one is worth its own conversation.
2.1 The bottleneck has moved — and it is now upstream and downstream of the code
Across the SDLC, AI agentic architecture and spec‑driven development discussions, the same observation kept returning: AI makes it trivial to produce code, so the real work has moved to either side of the keyboard. Upstream, the hard part is eliciting tacit decision logic from subject‑matter experts who often cannot articulate their rules beyond “when I see it, I’ll know.” Downstream, code review, PR review, UAT and compliance sign‑off become the new bottlenecks, especially in domains like payroll or ISO compliance where correctness is non‑negotiable. One participant called the effect of AI‑assisted code generation a “tsunami of code” that threatens to drown review capacity; another described asymmetric contractual liability (contracts with 10× value liability caps) as a reason enterprises cannot move fast even when AI could. The through‑line: AI amplifies pre‑existing organisational strengths and weaknesses. Teams without strong requirements discipline and quality gates get more rubbish, faster.
2.2 “Local” is not a single axis
The local‑models conversation (session 2) refused to let “local” collapse into “runs on my laptop.” Participants surfaced at least four distinct meanings worth separating: (a) on‑device execution for latency, cost and offline use (phones, Apple’s Neural Engine, WebGPU in the browser); (b) open weights/open source for control over model behaviour, safety filters and inspection; (c) bounded or sovereign deployment — ad‑hoc per‑project stacks (one attendee described using NixOS to spin inference up or down, locally or on Groq, as part of the project’s own orchestration); and (d) air‑gapped for hard privacy (the DGX Spark demo, Raul’s on‑device LLM). The warning that closed the loop was equally important: “privacy is a property of the system, not one part of the system” — a locally running model with web‑connected tools can be more vulnerable than a cloud API if the surrounding stack is poorly configured.
2.3 Spec as the source of truth — and specs written for agents, not humans
Spec‑driven development (and the phrase “spectral jail”) was argued as the natural corollary of generative code: if code can be regenerated from a well‑written markdown spec, then the spec becomes the canonical artifact, not the code. But the twist that made this feel genuinely new was that the audience for the spec is now the agent. One participant proposed a practical quality bar: two different agents should be able to implement the same feature via different approaches and still produce functionally equivalent outcomes. That demands far tighter specs — explicit input/output contracts, explicit enumerations, and, critically, explicit non‑functional requirements (performance, accessibility) rather than leaving them as architectural lore. “Wishy‑washy” specs were said to produce code that is “running 95% of the time and you don’t know that it’s not going to behave correctly that 5% of the time.”
2.4 Control loops, evals and the end of vibes
Every technical session eventually converged on the same engineering truth: you cannot improve what you cannot measure, and the measurement substrate for agentic and LLM systems is not yet solved. The evals session opened with the blunt observation that “a lot of people just do vibes‑based testing and there’s clearly no future” in that. From there the discussion mapped out several useful tools: structured outputs and enumerated fields so failures are detected immediately; synthetic scenario generation to bootstrap eval datasets where no labeled data exists; AI‑generated “both sides” conversation traces for human review; pass / fail / needs‑review / irrelevant rubrics that non‑technical reviewers can actually use; and the cost strategy of generating with a strong model (Sonnet) and judging with a cheap one (Haiku). Two sharp caveats were raised: (1) LLM‑as‑judge is not always needed — at small scale, feeding reviewer comments directly back into prompt engineering can eliminate bugs before they recur; (2) a happy‑path eval gives false confidence, exactly like a unit test that only checks the happy path.
2.5 Determinism vs autonomy — and the case for workflows instead of agents
The AI architecture session surfaced the single most practical insight of the day for anyone actually shipping agents: if you already know the shape of the work, don’t let the agent roam. Treat the process as a deterministic workflow and invoke the LLM only at the specific steps that need it, inside a harness that forces progression. Ways of imposing determinism were shared in depth: temperature and top‑k tuning; structured outputs with JSON enums; ReAct‑style tool loops with validators between iterations; ring‑fencing tool calls by “intent” (the Agent Gateway project); and, most radically, using AI to write the deterministic code that runs the workflow instead of using an agent at runtime at all. The balancing cost — and the open problem — is that over‑constraining the agent destroys the reason you wanted one.
2.6 Memory, context and the economics of long runs
Long unsupervised agent runs expose two related failure modes. The first is context economics: re‑sending accumulated context on every tool iteration creates quadratic cost growth, and allowing a run to approach very large windows (a million tokens) not only costs more but degrades performance — “it gets stupider” was the phrase used. The second is agent amnesia and handoff: per‑agent persistent memory does not compose across teams of agents. A charmingly simple pattern was described: have each agent leave an “I was here” (IWH) file in a directory describing its progress; the next agent reads it, deletes it, and leaves its own if unfinished. Elsewhere, TersoDB — a database‑backed file system recording every mutation — was discussed as a way to support true time‑travel replay/rollback for long runs, in contrast to audit logs or simple snapshots. The generalisable principle: externalise state; cap runs; break jobs into phases.
2.7 Safety is a system property, not a model property
Security and safety came up in both the AI architecture session and the local‑models session. The group was consistent and pointed: do not rely on an LLM alone to be the last line of defence. Real‑world failures discussed included a retailer shopping assistant that could be coerced into adding self‑harm items to a cart when a user expressed suicidal intent; Amazon jailbreak anecdotes; and the Air Canada chatbot that gave bad bereavement‑fare advice the airline was then held liable for. The pattern recommended: intent classification at the input, deterministic scanners/filters for PII/PCI at the output, format constraints as a final gate, and a clear eye that rare failures dominate reputational impact even if the average case is fine. One related thread (from the local models session) was specifically about controllability of safety filters: processing mental‑health counselling transcripts required a locally fine‑tuned Llama model with relaxed safety constraints because hosted models blocked trauma content. Safety is contextual.
2.8 The implicit social contract around AI authorship is failing
The final thematic thread — carried mainly in the closing demos but echoed elsewhere — argued that the problem with AI‑generated artifacts is not their existence but the hidden nature of their creation. The project presented during demos proposes a non‑judgmental disclosure language, inspired by Daniel Miessler’s AI Influence Level (AIL), in the spirit of food labels (“contains peanuts”), Creative Commons badges and pronoun stickers. It separates the author’s declared outputs (e.g. words: AIL 0 human‑written; code: AIL 2; tests: AIL 3) from the inputs a maintainer will accept from contributors. The driving analogy was driving speed: everyone thinks their own AI‑use norm is the correct one, so an explicit tag lets people with incompatible comfort levels still collaborate without ending relationships. Cory Doctorow’s 2 March 2026 piece on non‑consensual AI slop was invoked as the motivating evidence that the current implicit contract isn’t working, and both Datasheets for Datasets and Model Cards for Model Reporting were named as the academic lineage for food‑label metaphors in ML transparency.
2.9 Australia, sovereignty and the case for local capability
Woven through the opening and the hardware demos was a quieter nationalistic thread. The sarcastic “lucky country run by second‑rate intellects” framing from Donald Horne’s 1964 book The Lucky Country was invoked explicitly (and, as a participant noted, Australians’ biggest export is smart people who end up in Silicon Valley). Against that backdrop, more startups producing more complex technology was positioned as materially better for Australia’s economic resilience — especially given the awkward geopolitical position of selling commodities to China while being closely allied with the US. The DGX Spark in the corner of the room was not just a piece of kit; it was evidence that data‑centre‑class architecture (Blackwell, ARM64, CUDA) can now sit on a desk, enabling local capability work that previously required a cloud contract. Running LLMs on a phone or iPad was presented as the same argument at a different scale: sovereignty over data and execution.
3. Big ideas worth taking home
A selective, opinionated list of the concrete ideas from the day most worth stealing.
- Write the spec for the agent, not the human. Include input/output contracts, explicit enumerations, and non‑functional requirements. Aim for a spec tight enough that two agents produce functionally equivalent implementations via different approaches.
- Force structured outputs. JSON with enumerated fields turns soft failures into hard, immediately detectable ones you can retry deterministically.
- If the workflow is fixed, don’t use an agent at runtime. Use AI to write the deterministic code, then run the code.
- Call the LLM at specific steps of a deterministic harness, not as a free‑roaming agent, whenever you can tolerate the loss of autonomy.
- Cap runs. Externalise state. Break jobs into phases. Long agent runs eat context quadratically and degrade in quality.
- Pattern: “I was here” (IWH) handoff files. A lightweight way to pass progress between agents without relying on per‑agent memory.
- Instrument evals like analytics. Evals are the AI‑era equivalent of click/funnel analytics: they produce second‑order data that drives improvement.
- Build an eval UI for non‑technical reviewers. Colour‑code agent‑generated spans green. Use pass / fail / needs‑review / irrelevant. Close the loop like a bug tracker. Avoid the word confidence.
- Cost strategy for LLM‑as‑judge: generate with a strong model (Sonnet‑class), judge with a cheap one (Haiku‑class), where nuance permits.
- Bootstrap evals with synthetic scenarios. Have the model play both sides of 100 fake conversations, then hand them to a human assessor.
- Stimulus‑based feedback beats abstract interviews. SMEs who can’t articulate rules will critique specific synthetic traces at length.
- Build a non‑AI interface first that forces labelling. Pass/fail + “why” captures evolving requirements over time.
- Treat privacy as a whole‑system property. Local model + web‑connected tool + shaky config = cloud risk without cloud benefits.
- Use local when cloud safety filters are in the way. A locally fine‑tuned Llama was needed to process mental‑health counselling transcripts.
- Hybrid routing is the near‑term architecture: local first, escalate to frontier models on complexity. OpenRouter and LiteLLM were named.
- Hedge cloud token price risk. One prediction from the local‑models session: tokens become the currency of 2026.
- Consider SSD‑streaming to run very large models locally — weights on disk, stream needed parts into memory. (See AirLLM below.)
- Legacy modernisation is not a translation problem — it’s a verification problem. You can convert stored procedures to TypeScript; proving behavioural equivalence without comprehensive tests is the real work.
- Adopt a disclosure language for AI authorship based on Daniel Miessler’s AIL: declare your outputs (words, graphics, code, tests) and the inputs you will accept from contributors.
4. Memorable quotes
Attributions are omitted by Chatham House Rule; we’ve preserved phrasing lightly edited for readability.
“Whoever comes is the right people; whatever happens is the only thing that could have happened; when it starts, it’s the right time; when it’s over, it’s over.” — on the Open Space principles.
“Lucky and dumb… Australia is a lucky country run by second‑rate intellects.” — opening riff on Donald Horne.
“It’s more expensive than the car that I’m driving.” — on the DGX Spark.
“It’s the only computer in history that’s gone up in value since you bought it, because RAM has gone up in value.” — also on the DGX Spark.
“You can’t limit the agent too much because you lose all the benefit of it, but you can’t let it free because it will do all the dominant [wrong] things.” — on ring‑fencing.
“If it’s running 95% of the time and you don’t know that it’s not going to behave correctly that 5% of the time.” — on wishy‑washy specs.
“Code isn’t source of truth. The spec is the source of truth. The code is what’s running — but correctness is measured against the spec.”
“You do not need to let the agent do everything. Maybe you use [AI] to create a program that runs your fixed workflow — even faster and more predictable.”
“If you’re trying to funnel that into spectrum development, the whole sort of art of business analysis needs to adapt. Software engineering is already hugely adapted to AI and business analysis is moving very slowly.”
“Tokens will become the currency of 2026.” — on cloud token pricing pushing local adoption.
“Privacy is a property of a system, not one part of the system.”
“AI amplifies pre‑existing organisational problems. If teams don’t understand the problem, AI makes it easier to produce more rubbish, faster.”
“Specs written for agents, not humans, must be far more specific — precise enough that two different agents can implement them differently but still produce the same functional result.”
“A lot of people just do vibes‑based testing and there’s clearly no future [in that].”
“If I saw a code base that only has tests for the happy path, I would be like, are these tests even testing the code? What failure modes of the production system do your evals catch — and what failure modes come from bad evals?”
“Is it testing? Because it’s not the same concept as software unit tests.” — on what evals actually are.
“First I want to sell this: I’m doing it for control over the full stack. I don’t sell it as better than ChatGPT — you can’t compete at that level. The point is I own it.”
“I have used this when we’re in the bushes and my wife and myself were discussing a topic and we didn’t have internet. Literally we didn’t have internet because we’re in the middle of nowhere.” — on offline use cases for local assistants.
“This is not about hating peanuts. It’s about you knowing that they’re peanuts.” — on the food‑label metaphor for AI disclosure.
“Everyone thinks their own [driving] speed is correct; faster drivers are lunatics and slower drivers are dangerously inconsiderate.” — applied to AI‑use norms.
“Any words that come into this project, I expect a human to have written. If you give me a PR, you can write me a letter; if you send a commit, the commit message, I expect it to be written by a human. It would be ridiculous if I did anything but [AIL] 2 and 3 for code and tests — but I could.”
“Having a startup is lonely.”
“What I do is summarise all this and it ends up being an incredible resource… capturing the exhaust fumes of a day like today and turning them into something actionable and long‑lasting.” — on recording and anonymising the sessions.
“Don’t smile because it happened, baby, cry because it’s over”–Sabrina Carpenter
5. Open questions the day left on the table
These are genuine unresolved questions that multiple participants engaged with. They are a good prompt for future sessions, writing, or experimentation.
- How do you ring‑fence an agent tightly enough to be safe, without destroying the autonomy that made it useful?
- Are evals tests, or observability? Or is the answer “both, and treat them differently depending on whether you’re gating a POC, measuring production, or debugging”?
- What are the failure modes of bad evals? (No one had a confident answer; the open challenge was raised explicitly.)
- How much spec is enough? How do you avoid recreating waterfall while still producing specs precise enough for cross‑agent consistency?
- How do you prove behavioural equivalence when modernising a legacy system that has no comprehensive tests and decades of undocumented edge cases?
- Is the right modernisation unit the application (rewrite) or the data (port and make customer‑owned)?
- Can homomorphic / semantic encryption (LangGraph’s Agent Crypt proof‑of‑concept) make cloud escalation privacy‑preserving enough to outcompete pure local execution at scale?
- Will WebGPU genuinely turn the browser into a local inference surface, and if so, does that change distribution economics for AI‑powered apps?
- What happens to Scrum when a sprint’s planned work completes overnight? Is sprint planning obsolete, or more important than ever?
- Does the non‑judgmental AI disclosure language get adopted voluntarily, or does it only work when embedded in codes of conduct and publisher/conference policy?
- What is the right division of labour in AI‑first spec writing between product managers, UX designers, BAs and engineers? Does the designer/engineer role merge?
6. References — people, papers, books, articles, tools
Where a reference is to a third party the attendees named (not an attendee), we’ve identified them with sources. Attendees themselves are not identified.
6.1 People referenced (non‑attendees)
- Donald Horne — Australian author, invoked via his 1964 book The Lucky Country. The famous line (“run by second‑rate intellects”) was framed explicitly as sarcasm, as Horne intended it.
- Martin Fowler — cited in the intro as a figure associated with the agile‑era practices that underpin Open Space Technology.
- Andrej Karpathy — not named directly but the author of both NanoGPT and nanochat, which were referenced in the SDLC session (nanoGPT) and DGX Spark demo (NanoChat training in ~8–12 hours on Spark).
- Timnit Gebru and Margaret Mitchell — implicit authors of Datasheets for Datasets and Model Cards respectively, referenced as prior work using the food‑label metaphor for ML transparency.
- Daniel Miessler — security/tech writer; creator of the AI Influence Level (AIL) disclosure scale that foundationally inspired the Purposeful/Per‑Leia project presented in the demos.
- Cory Doctorow — author and activist; his 2 March 2026 piece “No one wants to read your AI slop” was invoked as evidence that the implicit social contract around AI‑generated artifacts is failing. A speaker mentioned knowing Doctorow via Creative Commons.
- Hamel Husain and Shreya Shankar —creators of the widely recommended AI Evals For Engineers & PMs course on Maven.
- Michael Feathers — referenced as the author of Working Effectively with Legacy Code — the canonical reference for modernising codebases that lack tests.
- Henri‑Paul Motte — 19th‑century French history painter; used as a metaphor for LLMs as “engravings of the masters” — accessible reproductions of work previously out of reach. (And yes, Motte = hill, so “Henry Hill” — the same name as the protagonist of Goodfellas.)
- Sabrina Carpenter — attributed with a paraphrase of “Don’t smile because it happened, baby, cry because it’s over.” The quote is commonly (mis)attributed to Dr Seuss and sometimes to Gabriel García Márquez; its true origin is disputed. Worth treating as public‑domain folk wisdom rather than any single author.
- Michel Pleifer and Maya Soren — named as collaborators on the Purposeful/Per Leia AI‑disclosure project. Public footprints are light; they were described on the day as a “luminary in the AI tech scene” (Pleifer) and as involved in fan‑fiction / publishing norms (Soren).
6.2 Papers and articles
- Gebru et al., Datasheets for Datasets (2018) — documentation metaphor for ML datasets.
- Mitchell et al., Model Cards for Model Reporting (2019) — benchmarked evaluation documentation for trained models.
- Daniel Miessler, AI Influence Level (AIL) v1.0 — the 0–5 scale for declaring AI involvement in content.
- Cory Doctorow, No one wants to read your AI slop (Pluralistic, 2 March 2026) — argument that non‑consensual AI artefacts break social contracts.
- Anthropic interpretability papers — referenced in passing (“Anthropic papers”) during the local‑models discussion on open‑weights analysis; see the Transformer Circuits thread for this line of work.
6.3 Books
- Donald Horne — The Lucky Country (1964). The opening riff.
- Michael Feathers — Working Effectively with Legacy Code (2004). Cited directly in the SDLC legacy‑modernisation discussion.
6.4 Courses
- AI Evals For Engineers & PMs — Maven by Hamel Husain and Shreya Shankar. Explicitly recommended in the evals session.
6.5 Tools, projects and products referenced
Organised roughly by where they fit in the AI stack.
Agents, harnesses and frameworks
- Claude Code — repeatedly discussed, both positively (throughput) and critically (harness updates breaking hooks/tool‑use).
- Cursor — mentioned as a context for using Claude Code.
- Anthropic SDKs — recommended over subscription/headless scraping for automation.
- MCP (Model Context Protocol) — discussed as an alternative to CLI‑based tool wiring for more interactive flows.
- GCP Agent Engine (Vertex AI Agent Builder) — praised for out‑of‑the‑box tracing/memory/deployment for multi‑agent systems.
- TersoDB — described as a Linux‑style filesystem running inside a Rust database, recording every mutation for time‑travel replay. (Public information is limited; the reference in the session was second‑hand.)
- Agent Gateway — a Linux Foundation–aligned project for ring‑fencing agent tool/A2A calls against declared user intent. (See e.g. the agentgateway.dev project.)
- DSPy (Stanford) — strongly recommended for typed/structured outputs and functional prompting to reduce variance and token waste.
- STORM — Stanford’s system that has multiple generated personas debate a topic before synthesising a report.
- Klein / Cline and orchestration layers (referenced as “Klein Kanban”) — mentioned as examples of emerging project‑level orchestrators that manage dependencies across an entire project.
- Roo Code — referenced as “Robo” in the spec‑driven discussion, used to produce a “complex document” from validated requirements before breaking it into Jira stories.
- openspec — used for writing functional specifications that an agent implements.
- spec kit — GitHub’s spec‑kit, referenced as an earlier generation of documentation‑to‑code scaffolding.
- superpowers skill set and its brainstorming / visualization companion skills — used to interview users and generate multiple HTML mockups in a single conversation. (See the superpowers skill pack and related work; skills here are Claude Code–style packaged capabilities.)
Local inference, routing and serving
- Ollama — demoed running on the DGX Spark.
- ComfyUI — demoed with Stable Diffusion 1.5.
- vLLM — referenced for fast inference serving.
- OpenRouter — hybrid local/frontier routing.
- LiteLLM — cross‑provider proxy layer.
- NVIDIA NeMo — mentioned alongside the DGX Spark ecosystem.
- AirLLM / Flash‑style SSD streaming — approach for running very large models from disk. See AirLLM.
- WebGPU — speculated as the coming distribution channel for in‑browser local inference.
- Apple Neural Engine — discussed as hardware well‑positioned for on‑device inference.
- NVIDIA DGX Spark — Blackwell‑based, ARM64, CUDA 13, ~128 GB unified memory; demoed on‑site.
Models named
- Google Gemma (distilled from the Gemini family) — framed as the coming wave of strong small local models.
- Qwen / Qwen3 — demoed on Ollama and used by one attendee on Groq’s free tier.
- Llama — fine‑tuned locally for the mental‑health transcript use case.
- DeepSeek — mentioned as a target for interpretability/autoencoder work.
- Whisper / “Whisper Turbo” — used for transcription.
- Mistral Voxtral — a voice model mentioned in passing.
- Claude Sonnet / Haiku / Opus — discussed for generation/judge trade‑offs.
- Mythos — named alongside Opus as a model/tool in the legacy‑refactoring discussion (details sparse).
Transcription and observability
- Deepgram and AssemblyAI — both named for diarisation.
- Weights & Biases — tried for evals, found “too technical” as an end‑user review UI.
- PostHog — example of live monitoring tooling.
Encryption, privacy, sovereignty
- NixOS — used by one speaker to spin dev/inference environments per project.
- Groq — used on a free tier to serve open models when “local” meant sovereign rather than on‑device.
- LangChain / LangGraph — ecosystem cited for a research proof‑of‑concept called Agent Crypt using semantic encryption to allow encrypted delegation to remote models. (Homomorphic encryption was raised as the broader conceptual frame.)
Legacy / SDLC / requirements
- Notion — used for meeting recording/transcription and turning conversations into user stories.
- Copilot — used inside one organisation for meeting synthesis where Anthropic models weren’t permitted.
- Confluence and Jira — the traditional target of requirements ingestion.
- Playwright — E2E testing in legacy contexts.
- UiPath — cited both as an RPA incumbent and for its emerging exploratory‑testing direction.
- Google Cloud Model Armor — used as a guardrail product; described as “hit or miss.”
Hardware / platform
- Dell and ASUS — named as DGX Spark / GB10 vendors in Australia.
- Apple Mac Studio — discussed as an alternative local compute platform (e.g. 512 GB unified memory configurations).
- Stone & Chalk — the venue.
- Project Zero — the coffee shop across the road, recommended for a lychee‑fermented single origin.
Real‑world incidents referenced
- The Air Canada chatbot case — bereavement‑fare advice the airline was held liable for.
- An unnamed retailer’s shopping assistant that could be coerced into adding self‑harm items to cart when a user expressed suicidal intent.
- Waymo — discussed in an autonomy/tolerance‑for‑mistakes analogy.
7. Session‑by‑session deep dives
7.1 Opening and orientation
The opening established tone, rules and the day’s grid. Sam from MLAI Australia pitched the Australian startup community mission: more complex industries, more startups, free co‑working and community for founders (with the aside that “having a startup is lonely”). The facilitator walked attendees through Open Space principles — whoever comes is the right people; whatever happens is the only thing that could have happened; when it starts is the right time; when it’s over, it’s over — and the law of two feet: if a session stops being useful, leave. It’s not rude. It is the method.
Attendees wrote session ideas on cue cards, discussed them briefly with neighbours, submitted them for clustering into themes and dot‑voted the results. Recording and anonymisation were discussed openly: participants consented to having their phones record sessions for summarisation, with the explicit promise that the output would be anonymised and that redactions were available on request.
The four opening tracks that emerged were:
- Agentic system architectures (the AI Architecture session — §7.2)
- AI‑driven software development life cycle (became the local‑AI SDLC discussion — §7.3)
- Privacy and data (folded partly into §7.3 and the demos — §7.5)
- AI skills, juniors, and upskilling (people/skills track)
Two participants asked for specific accommodations worth noting as culture: one participant workshopping a product asked to curate the output of their session because the output was the thing they were designing; another gave a gentle request for speakers to enunciate and project, as an English‑second‑language, hard‑of‑hearing participant — a small intervention that visibly improved the day.
The DGX Spark sat on a table, running every 20–30 minutes. Its presence anchored the day’s recurring question: what can you now do locally that used to require the cloud?
7.2 AI Architecture / Agentic system architectures
The richest technical session of the day. It opened with a discussion of TersoDB — a Linux‑style filesystem running inside a Rust database that records every file mutation, enabling time‑travel replay for agent runs. The argument that landed was that tracing captures LLM calls; audit logs capture discrete events; rollback requires capturing every mutation, and only a database‑backed filesystem naturally gives you that. A counter‑argument for Git‑style version control (human‑readable, familiar) was raised but ultimately the room leaned toward database‑backed for production: queryable, remotely accessible, and supporting tenant isolation once agents leave the dev laptop and start running in shared environments.
The conversation then moved into the most practically useful thread of the day — why agents skip steps, ignore tool instructions, and break harnesses, and what to do about it:
- Temperature and top‑k tuning — towards zero for determinism.
- Structured outputs (JSON + enums) — turn silent failures into detectable ones.
- Deterministic workflow + LLM at specific steps — the harness decides the phase transitions.
- ReAct‑style tool loops with validators between iterations.
- Ring‑fencing via intent‑based access control — Agent Gateway was introduced here.
- Stop using an agent at runtime — for fixed workflows, have AI write the code that runs the workflow.
A use case grounded the conversation: a field‑service troubleshooting agent that inspects graphs, correlates signals, and produces on‑site actions for a serviceman. The practical advice landed hard: that sounds like an agentic workflow, not a free‑roaming agent — let the harness drive and call the LLM only where creativity is needed.
The conversation then opened two further fronts. Token economics: long tool loops resend context on each iteration, so cost grows quadratically; very long windows both cost more and degrade performance. Practical mitigations: cap runs (e.g. 25 turns), break jobs into phases, externalise state so the next iteration doesn’t restart from zero. Multi‑agent memory: per‑agent memory features don’t compose; the “I was here” (IWH) file pattern was described as a lightweight way for agent teams to hand off progress — each agent reads the note, deletes it, and leaves its own if unfinished.
Frameworks and techniques named: GCP Agent Engine for out‑of‑the‑box multi‑agent scaffolding and deployment; DSPy (from Stanford) for typed/functional prompting — one attendee demonstrated extracting a numeric annual revenue with a confidence score from a financial disclosure statement, replacing verbose prose with a float; STORM, also Stanford, for debate‑style report generation where multiple personas argue before a synthesis step (loosely compared to mixture‑of‑experts, though different).
The session closed on safety and security, which set up later threads. The key framing was: don’t rely on the LLM alone. Intent classification at the input; deterministic PII/PCI scanning at the output; strict format constraints as a final gate. Real‑world failures were shared — a retailer shopping assistant that failed the self‑harm test; Amazon jailbreak anecdotes; Air Canada’s chatbot. Google Cloud Model Armor was described as “hit or miss.” The Waymo analogy made the deeper point: the public tolerance threshold for rare machine failures is much lower than for rare human ones, and reputational damage from rare failures will dominate even when the average case is fine.
7.3 AI‑driven SDLC
A discussion among consultants, contractors, data scientists, and long‑time builders (including a head of HR/payroll tech at ReadyTech, an Open Electricity contractor, a Black Book AI principal consultant, Freshwater Futures and a Stuart International cloud architect) about how AI is reshaping the SDLC and where the real bottlenecks now sit.
The opening framing came from a participant who’d been thinking about an “AI SDLC” framework — mapping the phases (planning, requirements, design, testing, build/architecture, deployment, monitoring/ops) against the AI levers available at each: Claude.md agents, MCP tools, packaged skills, scheduled tasks, CI workflows in GitHub Actions, and the emerging orchestrator tier exemplified by Klein/Cline‑style project orchestrators. His honest report from running an AI‑augmented SDLC workshop: even compressed to one day, the group only managed to do plan, design, test and build. There’s so much “juice” in deployment and monitoring they couldn’t reach in the time available.
The dominant findings:
- The hardest part is eliciting tacit decision logic from time‑poor SMEs who cannot articulate their rules beyond “when I see it, I’ll know.” A vivid example: a government infrastructure authority trying to consolidate thousands of nuanced engineering comments on drawings — when to merge similar fence‑related comments vs keep them separate due to different regulatory purposes, with engineers unable to articulate why beyond pattern recognition.
- Government and enterprise constraints can consume millions before production. One case cited: nearly $2M spent on a project without production deployment, due to operating models, security/privacy work and repeated PoCs; even spinning up an Azure server could take weeks due to gatekeeping. AI doesn’t remove gatekeeping.
- Stimulus‑based feedback works. A successful approach from a financial services claims‑assessor chatbot project: generate hundreds of synthetic scenarios and AI‑produced conversation traces, then have SMEs rapidly critique what’s wrong. The synthetic stimulus unlocked detailed feedback that abstract questioning could not. (This is the same pattern as the synthetic‑scenarios technique used in the evals session — strong cross‑session validation of the approach.)
- A non‑AI interface that forces labelling first. Building a baseline interface that requires users to mark outputs good/bad with reasons creates a continuous feedback loop that captures evolving requirements over time. The example given: tunnelling work today, rail systems tomorrow — requirements need to keep adapting.
- Legacy modernisation is a verification problem, not a translation one. You can convert stored procedures to TypeScript modules with AI; proving behavioural equivalence without comprehensive tests is the actual challenge. Michael Feathers’ Working Effectively with Legacy Code was cited directly.
- Data may be the durable asset. Applications can be replaced; systems of record remain. One participant referenced learning at Xero that while functionality changes, the data remains valuable; others suggested making data more portable and customer‑owned.
- The “tsunami of code” reframes the SDLC bottleneck as PR review, testing and UAT — especially in high‑correctness domains like payroll. The advice: fix testing and review processes before going all‑in on AI‑driven velocity.
- Exploratory testing automation was raised as an emerging direction (referencing UiPath’s direction): an AI agent given a time window and a goal to “try to break stuff” without predefined test cases — akin to continuous monkey testing.
- Contractual liability in enterprise/government contracts (e.g. 10× contract value liability caps) makes experimentation harder. Modular, smaller pieces of work reduce risk but cut against enterprise sales incentives for fewer, larger engagements.
- Scrum under AI acceleration: a real debate about whether sprints make sense when work completes overnight, versus the view that planning and quality gates become more important, not less, as code generation accelerates — lest teams build “tech for tech’s sake.”
The meta‑observation that closed the session: AI amplifies pre‑existing organisational problems. If teams don’t understand the problem, AI makes it easier to produce more rubbish, faster. Domain expertise access is non‑negotiable; AI cannot be used to “skirt” the need for real understanding.
7.4 Local AI and local models
Despite the filename, this session turned into the local models conversation — and it was one of the strongest discussions of the day. It started with transcription tooling (good mics, Whisper Turbo, diarisation from Deepgram/AssemblyAI, Mistral’s Voxtral voice model) and then quickly expanded.
The opening thesis was that local models are reaching a tipping point: distilled models (Google’s Gemma family, distilled from Gemini‑class) are getting rapidly better; phones and laptops already carry sufficient compute; WebGPU could shortly allow web apps to push a model into the browser on load; Apple’s Neural Engine and similar on‑device accelerators make token generation velocity “insane.” One participant predicted “tokens become the currency of 2026” as cloud token prices rise and local options compete. A practical trick was raised for very large models: SSD streaming, storing weights on disk and streaming needed parts into RAM, enables laptop‑plus‑SSD setups to run models far larger than memory would otherwise allow (see AirLLM).
The single best reframing of the session came next: local is not one axis. A participant argued that we should mean at least four things by “local” — on‑device execution (phones/laptops/Neural Engine), open‑source/open‑weights control, bounded or sovereign deployment (ad‑hoc per‑project stacks, e.g. via NixOS spinning inference locally or onto Groq), and air‑gapped. This reframing underpinned the best piece of safety advice from the day: privacy is a whole‑system property. An air‑gapped box is not equivalent to a local model with browser‑connected tools — the latter can be more vulnerable than a cloud API in a well‑governed enterprise.
Two concrete use cases made the case for local strongly:
- Mental‑health counselling transcripts: commercial hosted models blocked trauma content on safety policy grounds. The team used a locally fine‑tuned Llama with relaxed safety constraints. This is an argument for control, not just privacy.
- University students teaching workshops: paid APIs are simply not affordable at student scale. Local models are the only way to run the scale of experiments students need.
A pattern emerged as the dominant near‑term architecture: hybrid routing — a simple router in front of multiple models that sends simple queries to a local model and escalates complex ones to a frontier provider. OpenRouter and LiteLLM were named. One attendee went further, describing an edge‑to‑cloud distributed compute stack where local motion/person detection runs on the edge device, authorisation runs on a phone, deeper analysis runs in the cloud — each tier adding value without redoing earlier steps.
The final thread pushed back gently on “open weights = open.” Training a frontier‑class model from scratch is a tens‑of‑millions‑of‑dollars problem. But the counter‑argument was that you don’t need to train from scratch to do meaningful work — you can still analyse and interpret existing open models, run autoencoders on DeepSeek or Qwen, and so on — much of that for far less money, though time remains a constraint. NanoGPT was cited as an example project.
The session closed on an intriguing forward‑look: encrypted delegation as an alternative to full localisation. Homomorphic encryption as the general concept; a LangGraph research proof‑of‑concept called Agent Crypt based on “semantic encryption” as a concrete direction — encrypt the sensitive parts of a message before escalating to a remote model, so the model does useful work on material it cannot decrypt.
7.5 Spec‑driven development
The session that introduced — and partly stuck with — the phrase “spectral jail” to describe a world in which the importance of requirements/specs becomes paramount again. The central claim was that teams are shifting from “code as the source of truth” to “specs/requirements as the source of truth,” because requirements written in markdown can now drive code generation. The corollary discipline is making non‑functional requirements (NFRs — performance, accessibility) explicit in the spec rather than leaving them as architectural lore.
The proposed loop sounded almost old‑school: talk to users, document requirements, challenge the gaps, then generate code, test, deploy. But the modern twist was that specs are now written for agents, not humans, so they must be far more precise. A practical quality bar emerged: two different agents should be able to implement the same feature via different approaches and arrive at the same functional behaviour. That’s a tighter bar than most current specs clear.
Approaches discussed: a combination of openspec for functional specifications plus additional “harnessing” to capture technical constraints (architecture, input/output contracts, parameters). One participant flagged the rub: “if the spec is too wishy‑washy then it allows for too much variance in the code, and you’ve got something that might be running 95% of the time and you don’t know that it’s not going to behave correctly that 5% of the time.”
A real workflow was shared from inside one organisation that’s not allowed to use Anthropic models but can use Copilot: meetings recorded and transcribed, Copilot synthesises the discussion, key stakeholders confirm “yes, this is what we discussed,” the product owner takes responsibility for correctness, and the validated requirements get turned into a structured document via Roo Code (referenced as “Robo”) before being broken into Jira stories.
A second technique landed strongly: the brainstorming skill from the superpowers skill set, which interviews the user and simultaneously generates HTML mockups via a visualization companion — “do you mean this, this or this when you say that?” — compressing multiple rounds of user research into a single conversation.
The discussion grappled honestly with whether all this is just waterfall with extra steps. The reframe: spec‑driven development isn’t full‑system specification up front; it’s a planning baseline that prevents pure “vibe coding” while still leaving room for iteration. Newer models also need less steering for basic tasks (“build Spotify” yields a more credible result now than it did three months ago), which makes “generate first, refine with users” more viable than it was.
Two warnings closed the session:
- Prompts like “build Spotify” embed many assumptions. Asking the AI to list its assumptions — or to ask grooming questions back — is a practical way to surface and validate the hidden decisions before code is generated.
- POCs and production systems have different NFRs. Build a POC without a clear hypothesis and stakeholders will push to productionise it. Cue: “it took a week to generate the POC; why two months to build it properly?” The pressure spec‑driven development puts back down the pipeline lands hardest on business decision‑makers who must answer ambiguous product questions that previously got “answered” implicitly by whatever the developer built.
A sharp closing observation from a self‑described business analyst in the room: “software engineering has already hugely adapted to AI; business analysis is moving very slowly.”
7.6 Evals and quality
This was the shortest session but probably the most directly actionable. It started with a builder arguing for the point of a fully local assistant: not to compete with ChatGPT (you can’t), but to own the full stack — privacy, offline reliability, control. The concrete example — using the assistant in the bush with no internet to settle a disagreement with a partner — is worth carrying around as a counter‑example to the “but the frontier model is better” critique.
From there the session pivoted to evals and did a round of why do you care. The range of motivations captured the industry in miniature:
- Productionisation hardening (bank architect, compliance/ISO CTO, consulting lead) — “things look very good at the experimenting stage but take it to production, we have to harden it.”
- Closing the iteration loop on agents (AI‑native startup co‑founder) — “it’s really easy to build an agent, but it takes months to actually get it to do what you want.”
- Independence from a central AI team (Culture Amp PM) — “there’s only one AI team across ten product teams; I don’t want to wait two years to analyse our comments.”
- Domain‑specific benchmarking (financial ESG researcher) — “how do I trust LLM‑as‑judge for very specific criteria?”
- Capacity limits on human evaluation (Services Australia programmer) — “there’s going to be a lot more evaluation than people will have capacity to do.”
- Escape from vibes (AI startup PM) — “a lot of people just do vibes‑based testing.”
The practical contribution of the session was a detailed description of a working eval UI for non‑technical reviewers, built by a participant for clients:
- Load a full conversation trace.
- Colour‑code agent‑generated spans in green so the reviewer knows what to scrutinise.
- Four options: pass, fail, needs review, irrelevant.
- A bug‑ticket‑style “fixed and tested” workflow closes the loop with the prompt engineer.
- Deliberately avoid the word “confidence” — users read it as a meaningful percentage even when it isn’t.
- Weights & Biases was tried and found too technical for end‑user review.
A key finding: the team expected to need LLM‑as‑judge and ended up not needing it, because feeding reviewer comments directly back into prompt engineering eliminated the bug before it recurred at scale. Useful caveat: this only works when you don’t have recurring volume of the same bug.
Two more practical techniques landed:
- Synthetic scenario generation to bootstrap evals when no labeled data exists. The AI generates scenarios, plays both sides of the conversation, and hands a hundred fake transcripts to an assessor in a single day.
- Cost strategy for LLM‑as‑judge: generate with a stronger model (Sonnet‑class), judge with a cheaper one (Haiku‑class), within the limits of nuance.
A second‑order reframe worth internalising came from one attendee: evals produce second‑order data, analogous to click/funnel analytics in web apps. The value isn’t any single judgment; it’s the systematically captured annotations, traces and reviewer feedback that let you improve the system over time.
And the sharp caveat, from the learner in the room: “if I saw a code base that only has tests for the happy path, I would be like, are these tests even testing the code? What are the failure modes of the production system the evals catch, and what are the failure modes of bad evals?” No one had a confident answer. That’s the question.
The conceptual frame that emerged — but was left unresolved — was whether evals are tests (pre‑deployment gates, like unit tests) or observability (live performance measurement in production). The most honest answer: they are different things called by the same word, and you should be explicit about which one you mean in context. The recommended course for anyone getting serious about the topic was Hamel Husain and Shreya Shankar’s Maven course.
7.7 Closing demos and the disclosure project
The final session braided together three threads: a disclosure language for AI authorship, a DGX Spark demo, and a phone‑based local LLM demo.
The disclosure project — presented by an attendee with collaborators Michel Pleifer and Maya Soren — was the most original conceptual contribution of the day. It used the film Psycho as an opening rhetorical device: the presenter was ostensibly talking about Moonshine (a Linux video‑casting tool, like AirPlay for Linux) but the real subject was community and a parallel project called Purposeful / Per Leia about transparent AI authorship disclosure.
The core argument:
- The implicit social contract around AI‑generated artifacts is not working (citing Cory Doctorow, 2 March 2026).
- The disclosure should be non‑judgmental, in the spirit of Daniel Miessler’s AI Influence Level — “not about banning AI, but about letting you know there are peanuts in this thing.”
- Declarations come in two parts:
- Outputs — what the author produced, declared per artifact type: words (AIL 0, human‑written), graphics (AIL 0), code (AIL 2), tests (AIL 3).
- Inputs — what the maintainer will accept from contributors: PR descriptions, letters and commit messages must be human‑written; code and tests may involve AI at declared levels.
- The cultural analogy was striking: LLMs as engravings of the masters — accessible reproductions of cultural work, akin to Henri‑Paul Motte engravings hung in 1930s parlours. It’s a cultural practice. Some will like it, some won’t, and that’s fine.
- Metaphors mixed deliberately: food labels (“contains peanuts”), Creative Commons (“a promise, not an enforcement”), pronoun badges (a lightweight social signal). The driving analogy was driving speed: everyone thinks their speed is right; a shared tag lets incompatible comfort levels still collaborate without ending relationships.
- Scope control was explicit: not an AI treaty, not legal enforcement, not an alignment mechanism, not a fight against slop, not a quality or effort judgment. Just an authorship disclosure vocabulary.
- Organisations and events (conferences, publishers, fan communities) can endorse or enforce via codes of conduct — a conference might disallow AI‑generated images except in talks about AI images; a fan‑fiction community might require handwritten contributions for certain IP.
Brad’s challenge — “peanut labels prevent death; why AI labels?” — got the clearest answer of the session: disclosure is not a safety mechanism, it is a communication mechanism. Without it, collaborations break over hidden assumptions. With it, people with incompatible preferences can still work together. Prior art was acknowledged: Datasheets for Datasets and Model Cards for Model Reporting used the food‑label metaphor first for ML transparency.
The DGX Spark demo — a compact Blackwell‑architecture Nvidia workstation, ARM64, CUDA 13, ~128 GB unified memory (with roughly 90 GB usable for GPU), running an Nvidia Ubuntu variant. Pricing was around AUD $7,500–12,000 depending on when you bought and how RAM prices had moved since. Dell, ASUS and a local reseller (“Multimedia Technology”) were named as vendors; the gold Founder’s Edition was noted. On‑site demos ran ComfyUI with Stable Diffusion 1.5 and Ollama serving Qwen3. The practical case for the machine: smaller 8B–40B parameter models, fine‑tuned for specific business tasks, run locally with no data leaving the box; NanoChat can be trained in about 8–12 hours on the device. Multiple units can be clustered. ARM64 ecosystem stability was named as still maturing.
The phone/iPad LLM demo (by Raul) was briefer and made the sovereignty argument at personal scale: running models directly on a device you control, in response to convenience‑driven privacy erosion from cloud‑only AI assistants.
The event closed with a paraphrase of the common “don’t be sad it’s over; be happy it happened” (attribution complicated, see §6.1), thanks to MLAI, Sam, Simon Shaw and Graham, and promotion of the upcoming AI Engineer conference (3–4 June) and Energy Hackathon (Fri 5 June) as the centrepieces of Melbourne’s AI Week.
Appendix A — A short glossary
Abbreviated in case these are useful for readers new to some of the terms.
- AIL — AI Influence Level. Daniel Miessler’s 0–5 scale for declaring how much AI was involved in a piece of content.
- A2A — Agent‑to‑agent protocol.
- IWH — “I was here” — a pattern of handoff files agents leave in a shared directory to signal progress.
- NFR — Non‑functional requirement (e.g. performance, accessibility, availability).
- POC — Proof of concept.
- MCP — Model Context Protocol, a protocol for exposing tools and resources to LLM clients.
- MoE — Mixture of Experts.
- PII / PCI — Personally Identifiable Information / Payment Card Industry data.
- STT — Speech to text.
- UAT — User Acceptance Testing.
This report was compiled from session recordings made at Stone & Chalk, Melbourne, on 11 April 2026. All attendees are anonymised per Chatham House Rule; third parties they referenced are identified where useful. If any identified third party appears incorrectly here, or if any participant wishes a redaction, please contact the organisers.
Great reading, every weekend.
We round up the best writing about the web and send it your way each Friday.
