Year round learning for product, design and engineering professionals

The Machine-Testable Future: Why AI’s Transformative Impact May Be More Narrow Than We Think

A provocation: The domains where large language models will deliver truly transformative breakthroughs in the near term aren’t determined by human need or market size—they’re determined by whether we can automate the evaluation of their output.

There’s a pattern emerging in how large language models are being adopted across different domains. The conventional wisdom suggests that AI will transform knowledge work broadly—legal analysis, marketing, business operations, “knowledge work” and software development alike. But I’m increasingly convinced that in the short term, we’re going to see a much more uneven landscape of transformation.

The key differentiator? Whether the output of a model can be machine-tested.

Code’s Decisive Advantage

Let’s start with what’s working. Software development has achieved genuine product-market fit with AI assistance across the entire spectrum of tools—from GitHub Copilot’s “spicy autocomplete” through conversational interfaces to sophisticated agentic systems like Claude Code and Cursor. We’re still early, but the adoption is real, widespread, and economically productive.

Why has this happened so decisively for code? I tell you, because we can test, we can machine test the output of these systems. Not perfectly, not completely, but extensively and automatically. We can lint it. We can compile it. We can run it. We can execute test suites. We can deploy it to staging environments and observe its behavior.

This machine-testability creates two profound advantages. First, in production use, we get rapid feedback loops. When an AI generates code, we can quickly determine whether it works—often within seconds or less. Agentic systems can even self-correct, iterating on their output based on test results without human intervention. Increasingly, systems can automatically run tests, execute linting, trigger compilation, all without waiting for a human to notice something’s wrong.

Second, and perhaps more importantly, machine-testability transforms the training process. RLHF, post-training, fine-tuning—become dramatically more efficient when you can automatically evaluate whether the output is correct. You can generate enormous volumes of training data. You can iterate quickly. You can do it cheaply. The feedback loop that makes these models better happens at machine speed and machine scale.

The Legal Contrast

Now consider law, another area where there is a lot of enthusiasm for the adoption of these technologies. People often use the metaphor that law is like code, for humans—Lawrence Lessig and others have explored this idea that legal systems are software for human societies. It’s an appealing analogy, but it breaks down in a critical way.

But law isn’t like software in any literal sense. And that creates a fundamental problem for AI systems.

When a language model produces legal advice or legal opinions, there’s no compiler. There’s no linter. There’s no way to run it and see where it fails. We need humans—expert humans—to evaluate the output. In [a recent interview with Nilay Patel of The Verge] (https://www.theverge.com/podcast/807136/lexisnexis-ceo-sean-fitzpatick-ai-lawyer-legal-chatgpt-interview), Sean Fitzpatrick CEO of LexisNexis observed that they employ hundreds of lawyers specifically to work with their AI systems, checking outputs and ensuring correctness.

This creates two critical bottlenecks. First, the evaluation is expensive. Lawyers a very well paid professionals. If software had to be tested the same way—with expert humans reviewing every line—the economics would be completely different. We probably don’t even have enough sufficiently software engineers to do that kind of testing at scale.

Second, not only are we much more expensive, but humans are orders of magnitude slower than machines at pretty much anything machines can do at all. The iteration cycles for both training and production use become constrained by human review speed. You can’t generate massive training datasets and evaluate them automatically. Every improvement cycle requires expensive human time.

Marketing’s Mixed Signals

Marketing automation represents another interesting case study. In early 23, there was a great deal of excitement about how these tools could help automate much of the marketing process. The excitement has notably cooled. Meanwhile, coding tools like Cursor and Windsurf and Claude Code have grown to hundreds of millions or billions in annual recurring revenue within a year or two. There’s sustained momentum there that we’re not seeing in marketing automation.

Why? Because marketing output suffers from the same fundamental constraint as legal output: humans need to evaluate it. Is this copy compelling? Does it match our brand voice? Will it resonate with our audience? These aren’t questions a machine can answer definitively. Every piece of marketing content that an AI produces needs a human to assess whether it’s actually good, whether it’s acceptable, whether it achieves its purpose.

The automation potential is far more limited than with code. You can’t just generate thousands of marketing campaigns, run them through automated tests, and train your models on which ones “compiled” successfully.

Marketing has another confounding factor: it involves taste. While software can be verified in various ways as to its correctness and factual correctness is a key aspect, too, of legal advice and decisions, marketing is much more about what will appeal to people’s tastes. Testing the quality of the output of a model in these use cases is even more challenging. 

The Knowledge Work Mirage

Then there’s the whole class of generic knowledge worker tools being integrated into Microsoft 365, Google Workspace, and similar platforms. I’m skeptical we’re seeing genuine product-market fit here. The signals suggest something different—perhaps these are sustaining innovations, table stakes features that major productivity platforms need to include to remain competitive. But transformative? I’m not convinced.

There might be one exception: tasks where the output doesn’t really matter that much. Forms that need filling, boxes that need ticking, bureaucratic requirements that nobody reads carefully. If the purpose is essentially performative rather than substantive, then AI-generated output might be perfectly adequate. But that’s not transformation—that’s automation of make-work.

Playing a Role

There’s one big category of use that we have yet to cover here which actually dwarfs legal and marketing use, and which is at least as big as, if not bigger than, the code generation use case. That is role-playing.

I must admit that this is something that tests my hypothesis because role-playing models are very much about taste, and testing their output requires humans at scale. Perhaps it’s something about the compelling nature of this use case that people have been engaged by these models for years, even predating ChatGPT, in this kind of scenario. 

My thesis: The Machine-Testable Threshold

In the near term, domains where AI output can be largely machine-tested are far more likely to see genuinely useful, transformative breakthroughs. Domains that require human evaluation for quality and correctness will see slower progress, more limited adoption, and less dramatic transformation.

This isn’t about what uses we might need or what would be valuable. Legal services are expensive and often inaccessible—there’s enormous potential value in making them more available through AI. Marketing is a massive industry where better tools could unlock tremendous productivity gains. But value isn’t the limiting factor. The limiting factor is whether we can create the feedback loops necessary to make these systems sufficiently good to deliver that value.

With code, we can. We’ve built decades of infrastructure for automated testing, compilation, linting, continuous integration. When an AI generates code, we have robust, fast, cheap ways to determine if it works. That infrastructure enables the rapid iteration cycles—both in production and in training—that make AI systems better.

Without that infrastructure, progress will be slower. Not impossible—we’re certainly seeing investment and innovation in legal AI, marketing AI, and other domains. But the pace of improvement will be constrained by the speed and cost of human evaluation. These systems will get better, but not at the same exponential rate we’re seeing with code generation.

What This Means for Innovation

Excites us about new technologies is not just how they might make existing work more efficient—though productivity gains certainly pay the bills and drive adoption. We get excited because they enable entirely new things that weren’t previously possible. Perhaps because they bend cost curves so much. Perhaps because they massively decrease the time it takes to achieve an outcome.

That’s where I think the machine-testability threshold becomes most significant. In domains where we can automate evaluation, we can push the boundaries of what’s possible much more quickly. We can experiment more freely. We can iterate more rapidly. We can discover novel applications and approaches.

In domains requiring human evaluation, innovation will be more incremental. These tools will make experts more productive—lawyers can draft documents faster, marketers can generate more copy variations. But the revolutionary applications, the things we can’t even imagine yet? Those are more likely to emerge in machine-testable domains.


This is my provocation: We should expect the landscape of AI transformation to be much more uneven than the hype suggests. Look for machine-testability as a leading indicator of where real breakthroughs will happen first. The domains where humans must evaluate every output will see progress, certainly, but it will be measured progress constrained by the fundamental economics and speed limits of human judgment.

The revolution is happening. But it’s not happening everywhere all at once, and it’s not happening because they are fantastic ideas. It’s happening where the machines can test themselves.

delivering year round learning for front end and full stack professionals

Learn more about us

Web Directions South is the must-attend event of the year for anyone serious about web development

Phil Whitehouse General Manager, DT Sydney