Web Directions Conferences (and more)

Agentic Self-Healing in Production — Jack McNicol at AI Engineer Melbourne 2026

john allsopp 6th May, 2026

Agentic Self-Healing in Production

It's 2:47 AM on a Tuesday. Your data pipeline fails. A sensor stops reporting, a database connection is dropped, an API that your system depends on starts returning errors. In the old world, this is when an engineer gets paged. They wake up, SSH into the server, run some diagnostics, fix the problem. Maybe it takes ten minutes, maybe it takes an hour, maybe they can't figure it out and they escalate.

In the new world, your system fixes itself. An AI agent monitors the pipeline, detects the failure, diagnoses the problem, and repairs it. By the time the engineer wakes up, the system is already healthy again. The engineer reviews what happened, learns if there's a pattern that needs architectural change, and goes back to bed.

This isn't science fiction. It's becoming operational reality for systems designed to support it. But it's not magic, and it requires careful thinking about both capability and guardrails.

The Self-Healing Architecture

Self-healing sounds simple: detect problems, fix them. But the gap between simple and safe is enormous. An agent that has the capability to repair systems also has the capability to make things worse. How do you give an autonomous system enough agency to actually fix things while maintaining enough constraints that it doesn't accidentally create new disasters?

The architecture starts with visibility. You need comprehensive monitoring that can tell you not just that something failed, but what failed and why. This isn't passive log collection. It's structured instrumentation that an agent can reason about. When the pipeline fails, the agent needs to be able to ask: what changed? What's the state of each component? What are the error messages? What are the patterns in the failure?

This information goes into a diagnostic system—essentially a reasoner that can take incomplete information and generate hypotheses about what's wrong. The agent looks at the symptoms and works backward to root cause. Is it a resource problem (out of memory, disk full)? Is it a connectivity problem (service unreachable, timeout)? Is it a configuration problem (wrong credentials, wrong parameters)? Is it a data problem (malformed input, unexpected schema)?

Once you've narrowed down the likely cause, you can propose fixes. But here's where it gets careful: the agent doesn't just implement the fix. It proposes the fix, checks that the proposal is safe, applies it, and then verifies that the system recovered. If verification fails, it rolls back and tries a different approach.

Safe Autonomy Through Patterns and Guardrails

The guardrails matter as much as the capability. You can't let an agent make arbitrary changes to production systems. But you also can't restrict it so much that it can't actually fix anything.

The pattern that works is constrained action spaces. The agent can restart services, but only ones that have been marked safe to restart. It can adjust parameters, but only within a defined range. It can query databases, but not modify them. It can add resources to a system, but not remove them. The constraints are configured explicitly and can be tightened or loosened based on experience and risk tolerance.

Observability is crucial. Every action the agent takes gets logged with high fidelity. What did it try? Why did it try it? What was the effect? This creates an audit trail and also enables learning. Over time, you can analyze what works and what doesn't, what patterns of failures the agent handles well and which ones it struggles with.

Escalation policies create a safety boundary. Some problems the agent is authorized to fix autonomously. Others need approval before action—the agent diagnoses the problem and proposes a fix, but an engineer has to approve before the agent implements it. For the most sensitive systems, the agent might be able to alert and gather diagnostics but not make any changes at all.

The agent itself should be able to ask for help. If it encounters a situation that's outside its training or confidence, it should escalate rather than guess. A system that stops and says "I don't know what to do, here's what I've tried, here's what I'm confused about" is more trustworthy than one that keeps trying until something breaks.

Learning from Failure

Jack McNicol has built systems that operate at scale, maintaining large platforms where failures are regular and the cost of downtime is high. The perspective he brings isn't theoretical. It's grounded in what actually works when real failures happen at 2 AM and you need systems to recover autonomously while preserving safety and predictability.

The lessons learned include hard constraints: agents are better at fixing problems they've been designed to handle than at improvising. They need clear categories of failures and clear response patterns. They're good at following procedures exactly but not at adapting procedures to novel situations. They should be given enough autonomy to be useful but not so much that they become unpredictable.

There's also a human factor. Engineers need to trust these systems enough to let them operate autonomously, but not so much that they stop paying attention. The best systems create a partnership where the agent handles routine problems and the human understands the system deeply enough to make judgment calls when something novel happens.

Implementation patterns emerge from this experience. Canary actions (test the fix on a small subset first before applying broadly). Staged rollout (apply to less critical systems before critical ones). Observability-first design (instrument heavily before automating). Clear success criteria (define what fixed actually looks like before the agent tries to fix it).

Building for Real Operations

Jack McNicol builds digital products and mentors teams, and he does it by blending thoughtful architecture with hands-on delivery. He understands that self-healing systems aren't just about AI capability. They're about designing human-AI collaboration so that both are better together than either could be apart.

The session will cover the patterns that make self-healing possible: how to structure monitoring and diagnostics so agents can reason about failures, how to design safe action spaces that give agents enough agency to be useful, how to set guardrails that prevent disasters, real-world lessons about what works and what doesn't, and the organizational changes needed to operate with agentic systems in production.

Jack will be discussing these patterns and architectures at AI Engineer Melbourne 2026, June 3-4.

Web Directions Year round learning for product, design and engineering professionals