Systems of Judgment
In AI, the scarcest resource is judgment.
The workflows we’re automating with AI today were never designed to be optimal. They were designed around the limits of humans doing the work. A software team has a product manager, a designer, and an engineer not because the work inherently requires three people, but because no single person could hold all of that context simultaneously. The Apollo program employed rooms full of human computers because no single mind could run the calculations. The division of labor in knowledge work is, to a significant degree, a workaround for cognitive constraints.
Every major tool shift makes old processes faster, but also tends to unlock new processes. Spreadsheets didn’t just help ledger clerks move quicker; they made whole new kinds of analysis possible. The internet didn’t simply speed up the mail; it changed how people found each other, worked together, and built things. When the worker’s capabilities change, the work itself changes too.
AI should follow the same pattern. But most of what’s being built today still asks AI to do the human’s job the way the human used to do it. Because we don’t fully trust it yet, we demand granular observability and benchmarks for performance. And the easiest benchmark is the old one: can it produce the same work a human produced, by following the same process a human followed?
That’s why so much AI lives as copilots inside existing interfaces. It makes the current process faster, but leaves the process itself intact. The bigger opportunity is to give agents objectives instead of instructions, and let them reshape software around how agents actually work. The real question isn’t “How many tokens can I burn in my work?” It’s “How must my work look different now that intelligence is abundant and judgment is the scarce resource? If we can answer that, the work itself will actually change.
Judgment sold separately
When working with AI, the scarcest resource is judgment. AI can ingest every data point, surface every pattern, run every scenario. What it can’t do is decide what matters in contexts that are ambiguous, high-stakes, and novel. Intelligence can be purchased as a utility — you can buy reasoning by the token from a half-dozen providers, and the price drops every quarter. Judgment can’t be purchased. It has to be built, captured, and compounded over time.
This is good news for application builders, because it clarifies where value actually lives. It’s not in the intelligence. It’s in how you apply that intelligence to scale the judgment of the person using it. Intelligence tells you what’s in the data. Judgment decides what to do about it — what to prioritize, what tradeoff to accept, what outcome to optimize for. Intelligence is general. Judgment is specific to a domain, a context, a set of goals defined by a human.
This reframes the role of software entirely. The job of the application isn’t to present data for a human to interpret (the old workflow). It’s to do the interpretation, make a recommendation, and then capture what the human does with that recommendation — accept it, override it, modify it — and tie that decision to the eventual outcome. A system of judgment is software that treats the human decision as the most valuable data point in the entire pipeline, and builds a learning loop around it. And judgment compounds. Every decision made inside that system becomes training data for the next recommendation.
What a system of judgment actually does
A system of judgment does four things, in a loop:
It ingests domain context and makes a recommendation. It captures the human’s actual decision — did they accept, modify, or reject the recommendation, and why. It observes the outcome. And it uses that complete cycle to make the next recommendation better.
The critical design insight is that the human decision is the most valuable signal in the entire pipeline. Not the input data. Not the model’s interpretation. The moment where an expert, with full context, chooses to override or refine the system’s suggestion — that’s where institutional knowledge gets created. And if you capture it, you can learn from it.
The shift isn’t “AI does the old job faster.” It’s “the work is redesigned so humans do the part that only humans can do — decide — and the system captures that decision and learns from it.” In this world the most important decision is which choices humans need to make and what can be handled by the AI.
Product has always been the game
None of this is new in principle. Machine learning has always worked this way. The difference between a good ML system and a bad one has never been the model, but the feedback loop. And the quality of a feedback loop has always been a product design problem, not a model architecture problem.
You need people to actually use the system. You need their usage to generate observable outcomes. You need those outcomes to feed back into the model. The tightness of that loop — how clean the signal is, how naturally it integrates into the workflow — is what determines whether the system compounds or stagnates.
This is why product matters more in the AI era, not less. The best systems won’t feel like “AI tools.” They’ll feel like better ways to work. And the judgment data will accumulate invisibly, because the product was designed so that doing your job is training the system.
Building a vertical judgment system
When I learn about an AI company, there’s one question that separates a system of judgment from a smarter dashboard: what happens after the user acts on the recommendation?
If the answer is “nothing” — if the recommendation is delivered and the system moves on — it’s a dashboard with a language interface. The intelligence is there, but it’s not learning. There’s no loop.
If the answer is “we capture the decision, track the outcome, and use both to improve” — now you have something that compounds. You have a system where usage makes the product better in ways that can’t be replicated by a competitor starting from zero.
These systems will be vertical, because judgment is domain-specific. What counts as a good outcome in clinical care is different from commercial underwriting is different from legal negotiation. The feedback loops are different, the outcome horizons are different, the regulatory constraints are different. You can’t build a general-purpose judgment system any more than you can build a general-purpose expert. This is why durable value lives in the application layer, not the model layer. Foundation models provide the reasoning. The application provides the domain-specific loop — recommend, capture the human decision, observe the outcome, improve. That loop, and the decision history it generates, is valuable.
For builders reorienting toward this model, the starting point is discovering the highest-stakes judgment call that a domain expert makes repeatedly. Design the entire product around capturing that decision in context and tying it to outcomes. Everything else — the data ingestion, the model, the interface — is in service of making that loop work.
The window
Judgment loops compound so the first system to reach flywheel velocity in a domain becomes very hard to displace. Not because of switching costs in the traditional sense, but because a competitor starting from zero has no decision history. They’re starting the learning loop from scratch while the incumbent’s is already spinning on customer specific context. Thousands of recommendation-decision-outcome cycles that only exist inside the system that captured them.
The founders who get this right won’t be the ones who built the best model or the most sophisticated agent. They’ll be the ones who understood the customer well enough to know which decisions matter most, designed a workflow that humans actually wanted to use, and quietly turned every judgment call into training data for what comes next.


