Web Directions Conferences (and more)

AI Agents Are Distributed Systems — Lovee Jain at AI Engineer Melbourne 2026

john allsopp 30th March, 2026

AI Agents Are Distributed Systems: Applying Distributed Systems Thinking to Agent Engineering

There's a curious blind spot in how many people approach AI agents: they think of them as monolithic systems. You give an agent a task, the agent processes it, the agent returns an answer. Simple cause and effect.

Real AI agents are nothing like that. They're distributed systems. They make API calls to retrieve information. They orchestrate multiple operations in sequence. They interact with other services and agents. They fail partially. They have consistency issues. They encounter network latency. They struggle with coordination.

The problems that AI agents face are identical to the problems that distributed systems have been struggling with for decades. And the solutions developed to address distributed systems problems often apply directly to agents.

This reframing changes how you think about building agents. It shifts the conversation from "is this agent smart enough?" to "is this system reliable enough?" It transforms agent engineering from a machine learning problem into a systems engineering problem.

Consider partial failures. In traditional software, something either works or fails. In distributed systems, you have partial failures: the agent makes an API call, but the API times out; one component responds but another doesn't; the agent completes part of its task but fails partway through. These aren't edge cases; they're normal operation.

AI agents face exactly the same problem. An agent might successfully retrieve information from one source but fail to retrieve from another. It might generate part of a response before hitting an error. It might make five correct decisions and then fail on the sixth. Handling this well requires distributed systems thinking: how do you structure the system so that partial failures are caught, understood, and handled gracefully?

The distributed systems answer is redundancy, retries, and clear failure semantics. You structure the system so that operations are retryable. You separate concerns so that failures in one component don't cascade through the entire system. You build observability that reveals partial failures before they affect users.

These patterns apply directly to agents. An agent should be able to retry operations. It should isolate failures in one capability from cascading through the entire agent. It should have observability that distinguishes between "the agent completed successfully," "the agent partially failed," and "the agent completely failed."

Consistency is another distributed systems problem that applies directly to agents. In distributed systems, you often can't guarantee that all parts of the system see the same data at the same time. You have eventual consistency: the system is consistent at some point in the future, but not necessarily right now. This creates problems and requires careful design.

AI agents face the same challenge. An agent might retrieve information about the state of a system, make decisions based on that information, and then execute those decisions — only to discover that the state has changed since it last queried. It's working with stale data. How do you handle that?

Distributed systems have developed patterns: versioning, timestamps, optimistic locking, conflict resolution. These patterns can be applied to agents. You version the data the agent is working with. You include timestamps in decisions. You build systems that detect when the agent's assumptions about state are violated.

Coordination is another classic distributed systems problem. How do you get multiple components to work together reliably when they're not always in sync, when failures can happen, when parts might be slow or unresponsive? This is exactly what happens when an agent needs to coordinate multiple API calls or orchestrate multiple operations.

The distributed systems solutions — consensus algorithms, timeouts, circuit breakers, coordination protocols — apply to agents. You don't need Byzantine fault tolerance, but you do need basic coordination strategies that keep the system functioning when things don't go perfectly.

Observability is where distributed systems thinking is most immediately applicable. Distributed systems are impossible to debug without good observability. You can't have a human following the execution path; you need structured logs, metrics, traces, and dashboards that reveal what's happening across the system.

Agents need exactly the same observability infrastructure. You need to see what decisions the agent made, what information it used, what operations it performed, where it failed. You need to trace the execution path. You need to understand the agent's internal state. Without this observability, you're flying blind.

The insight is that agent engineering isn't primarily a machine learning problem. It's a systems engineering problem. The machine learning part — making the agent intelligent — is important, but it's only part of the challenge. The systems engineering part — making the agent reliable, observable, fault-tolerant, and able to operate in the real world where things fail unpredictably — is equally important.

Fortunately, decades of distributed systems research and practice have developed solutions to these problems. Teams building agents don't need to solve these problems from scratch. They can apply proven distributed systems patterns: redundancy, retries, circuit breakers, timeouts, observability, eventual consistency, graceful degradation.

This reframing is powerful because it directs attention to what actually matters. An agent that's somewhat less intelligent but highly reliable and observable is more valuable in production than an agent that's very intelligent but fragile and opaque. Distributed systems thinking helps build the former.

Lovee Jain, Senior Software Engineer, Google Developer Expert, and AWS Community Builder, is presenting this talk at AI Engineer Melbourne 2026 on June 3-4.

Web Directions Year round learning for product, design and engineering professionals

AI Agents Are Distributed Systems — Lovee Jain at AI Engineer Melbourne 2026

AI Agents Are Distributed Systems: Applying Distributed Systems Thinking to Agent Engineering

delivering year round learning for front end and full stack professionals