LLM Watch: Thoughts on Tech

The Gap of Judgement: The Missing Piece for Enterprise AI Transformation

Pascal Biese — Fri, 06 Mar 2026 10:51:30 GMT

Decades of automation investment have digitized the skeleton of operations. What remains - the unstructured, ambiguous, exception-laden work - is precisely what AI agents are now positioned to solve. But the challenge isn’t capability anymore. It’s control.

There is a strange paradox sitting at the heart of every large enterprise right now. Organizations have spent the better part of three decades and billions of dollars automating their operations. ERP systems, workflow engines, robotic process automation, business intelligence dashboards - the infrastructure of the modern firm is a monument to deterministic logic. And yet, look closely at what actually happens inside a finance or operations team on any given Tuesday, and you will find something surprising: people are still spending the majority of their time doing things that feel, instinctively, like they shouldn’t require a human at all.

All slides in this article have been created with the courtesy of NotebookLM.

This isn’t a failure of effort or investment. It’s a structural property of the problem. Traditional automation is extraordinarily good at one specific thing: executing deterministic sequences on structured data. But enterprise reality is the opposite of deterministic. It is a landscape of intersecting, contradictory signals - an invoice that doesn’t match the PO, a vendor change request that cascades across seventeen open commitments, an exception that doesn’t fit any of the rules written into the system three years ago. Humans have always lived in that gap. Until now, nothing else could.

The Automation Plateau

The data here is uncomfortable in its persistence. NetSuite cites research showing that just 35% of finance professionals’ time goes to high-value insight work - the remaining 65% absorbed by routine data collection and validation. McKinsey puts the problem even more starkly: you cannot drive a business forward while spending 80% of your time on reporting and manual transactions. And despite near-universal investment in automation tooling - McKinsey’s 2024 CFO Pulse found 98% of finance leaders had invested in automation technologies in the prior twelve months - 41% of CFOs report that fewer than a quarter of their processes are actually automated.

This means that - if we oversimplify the numbers above for the sake of the argument - 60-70% of finance professional time is consumed by tasks that, in principle, should not require human judgment at all: gathering data across fragmented systems, reconciling numbers between spreadsheets and ERPs, managing exceptions that fall outside the logic of deterministic rules. That number has barely moved in a decade, despite massive investment in automation tooling.

The reason is visible in the shape of the productivity curve. Traditional automation follows a classic S-curve: rapid value creation early, followed by a plateau where incremental investment yields diminishing returns. What gets automated first is always the easiest - the structured, predictable, rule-bound work. What remains on the plateau is the residue: everything that requires context, judgment, cross-system interpretation, and the capacity to reason under ambiguity. The plateau is not a bug. It is the logical terminus of the deterministic approach.

The automation plateau is not evidence that organizations haven’t tried hard enough. It’s evidence that they’ve been using a fundamentally limited instrument - and have now reached the edge of what that instrument can do.

This distinction matters enormously for how we think about what comes next. The conversation in most boardrooms is still framed around whether AI will disrupt their industry, when the more operationally urgent question is much narrower and more tractable: can we finally automate the work that traditional automation has always failed to automate?

The Gap of Judgment

The architectural reason for the plateau has a name: the Gap of Judgment. It is the space between what deterministic automation can handle and what enterprise operations actually require. On one side of the gap sits everything that RPA and ERP were built for - if-then logic, structured data, predictable sequences. On the other side sits enterprise reality: unstructured reasoning, exception handling, cross-system translation, and the ability to make sense of situations that were never anticipated when the rules were written.

What makes the Gap of Judgment so durable is that it’s not simply a matter of complexity - it’s a matter of type. No amount of additional if-then rules bridges it, because the nature of the work on the other side of the gap is fundamentally probabilistic. Someone needs to reason about whether a given vendor exception is likely a data entry error or a legitimate dispute, and route it accordingly. Someone needs to look at a set of signals across four different systems and infer a coherent story about what’s happening to a payment. These are not lookup operations. They are inference operations. And inference, until very recently, was exclusively human territory.

Large Language Models changed this equation - not because they replaced the need for structured systems, but because they introduced, for the first time, something that can operate in the inference space. LLMs can handle ambiguity, reason through multi-step situations, and translate across incompatible data formats. The question that matters for enterprises is not whether these capabilities are real. It’s whether they can be deployed in a way that meets the control, compliance, and governance requirements of a regulated enterprise environment.

Three Stages, One Architecture

It is worth being precise about what “agentic AI” actually means in this context, because the term has been applied loosely to a spectrum of very different systems. The maturity path runs through three distinct stages, and conflating them leads to serious strategic errors.

Stage one - chatbots and copilots - is where most enterprise AI deployments currently live. The AI answers questions, generates drafts, suggests actions. A human receives the output and decides what to do with it. This is genuinely useful, but it does not address the automation plateau because it still requires a human in the critical path of every task. The bottleneck moves slightly, but does not disappear.

Stage two is where the substantive transformation begins. True agents don’t just answer, they execute. They can autonomously orchestrate multi-step processes, call APIs, read from and write to enterprise systems, and reason through sequences of actions that would previously have required sustained human attention. This is the capability that begins to close the Gap of Judgment in a meaningful way.

Stage three - the enterprise maturity path - describes the architectural progression through which an organization operationalizes true agency at scale. This is where the real design work begins, because raw agentic capability is necessary but not sufficient for enterprise deployment.

The path runs through three modes: Reactive (executing discrete tasks, read-only, stateless), Adaptive (building institutional knowledge through Bayesian confidence scoring), and Proactive (bounded autonomy with a live representation of enterprise state). Progression through these modes is not a software upgrade. It is a governance journey.

The Central Problem Is Control, Not Capability

This brings us to what is, in practice, the defining challenge of enterprise AI deployment - and the one that most technical discussions underweight. The question that keeps CIOs and compliance officers awake is not whether LLMs are capable enough to handle enterprise work. Increasingly, they demonstrably are. The question is whether they can do so in a way that satisfies the control, auditability, and regulatory requirements of a real enterprise operating environment.

The visual metaphor in the framework is apt: raw LLM capability is energetic and multidirectional, capable of operating across a huge range of tasks and contexts. Enterprise governance is a wall - immovable, intentional, and load-bearing. The productive relationship between these two things is not the LLM crashing through the wall. It is a deliberate architectural interface that lets the LLM’s reasoning capability operate while keeping its actions inside the compliance boundary.

LLMs can handle ambiguity and reason deeply. They cannot inherently operate within strict enterprise compliance. Deliberate architectural design is an absolute requirement. Trust is earned through architecture, not assumed from capability.

This reframing has significant practical consequences. It means that evaluating enterprise AI deployments primarily on the basis of model capability benchmarks is misleading. The relevant question is not “how capable is the model?” but “how well has the architecture been designed to make that capability safely operable in this environment?” These are different problems, and they require different expertise to solve.

The Enterprise Sandbox: A Controlled Execution Boundary

The architectural response to the control problem is what this framework calls the Enterprise Sandbox - a deliberate execution boundary inside which agentic reasoning operates, insulated from direct write access to production systems until outputs have cleared governance checks.

The architecture is worth tracing in detail because the design choices matter. Enterprise systems - SAP, ServiceNow, Excel - are connected to the sandbox through structured APIs. Data flows in, agentic processing happens inside the boundary, and outputs exit through a safety mechanism layer before reaching controlled output channels: human review queues and governed workflows. At no point does the agent touch a live production database directly.

The critical design principle here is inscribed at the bottom of the diagram: agents do not replace enterprise systems - they operate inside them. This is not a rip-and-replace architecture. The ERP is still the system of record. The workflow engine is still the workflow engine. The agent is a reasoning layer that can read, interpret, and propose - but the action still flows through the institution’s existing governance channels. This matters for adoption as much as it matters for safety. Organizations do not need to bet their operations stack on an unproven technology. They need to add an intelligent layer over infrastructure they already trust.

Simulation Before Action: The World Model Concept

One of the more technically interesting ideas in this architecture is the Enterprise World Model¹ - a live representation of enterprise state that agents can reason against before committing any action to a real system. The principle it embodies might be called simulation-before-act, and it deserves careful attention because it fundamentally changes the risk calculus of autonomous AI in enterprise environments.

Consider the specific example in the framework: an agent proposes to change vendor payment terms. In a traditional system, this kind of change would either require a human to manually trace all the downstream dependencies - open invoices, pending purchase orders, blocked payments - or it would simply go through and create cascading problems discovered only after the fact. The world model architecture routes that proposed action through a live simulation first. The agent sees 47 open invoices, 12 pending POs, 3 blocked payments. Constraint checks run against that snapshot. The action is either approved or blocked before a single production system is touched.

This is not a small increment over existing validation approaches. It is a qualitatively different capability, because it allows the system to reason about systemic effects - the kind of second- and third-order consequences that humans have always been responsible for tracing, and often fail to trace completely. A world model that can reliably predict cascading constraint violations before action represents a genuine expansion of what safe autonomous operation looks like.

¹We use the term "world model" loosely here, to mean a stateful, dynamic representation of enterprise systems and processes. It's a pragmatic definition, without any appeal to physical simulation or digital-twin architectures.

Context Graphs and Multi-Layer Governance

The governance architecture adds another layer of verifiability through what the framework calls Context Graphs - a mechanism for tracking the relationship between agent actions, predictions, and outcomes over time. The purpose is not just auditability after the fact, but active learning: the system accumulates evidence about the reliability of its own predictions, which feeds back into the confidence calibration of future actions.

The governance stack assembled here addresses a different class of risk at each layer. Pre-action simulation blocks constraint violations immediately - this is the world model mechanism working upstream of any action. Human approval gates provide structured review with the agent’s full reasoning chain visible - critically, not just the recommendation but the reasoning behind it, so that reviewers are not rubber-stamping opaque outputs. Append-only audit trails create a timestamped, field-level record of before-and-after state for every action - exactly what regulators and internal audit functions require.

Together, these mechanisms represent something important: a shift from asking “do we trust AI?” as a categorical question, to building the empirical infrastructure through which trust can be earned and demonstrated incrementally. That is a much more tractable problem.

²Again, a pragmatic definition - for a much less flawed definition and in-detail explanation of context graphs, I want to recommend this piece from .

Integration Without Rip-and-Replace

One of the most practically consequential claims in this framework is the integration philosophy: agentic architecture sits above the existing tech stack, not in place of it. The specific systems named - SAP as system of record, ServiceNow as workflow orchestration, Excel as the finance lingua franca - are not incidental. They represent the actual landscape of enterprise infrastructure as it exists, not as architects might wish it looked.

Organizations have spent decades and enormous resources building, customizing, and integrating their core enterprise systems. A deployment approach that required wholesale replacement of that infrastructure would face prohibitive switching costs and organizational resistance - and rightly so, because the institutional knowledge embedded in those systems is real and valuable. An approach that treats the existing stack as the data substrate, and adds intelligent reasoning capability as a layer above it, sidesteps that objection almost entirely. The agents read and reason over existing data formats. SAP remains the system of record. Excel remains the finance lingua franca. Nothing that currently works stops working.

A Data-Driven Progression of Autonomy

How organizations actually move from here to a fully agentic operating model is one of the hardest questions in enterprise AI, and the framework offers a clear structural answer: phased progression, where each phase produces the empirical evidence that justifies the next. This is not a roadmap in the abstract planning sense. It is a feedback-driven escalation protocol.

Phase 1 - Shadow Mode. The agent runs in parallel with existing processes, with no write access. Pure calibration - the system generates predictions and recommendations, but nothing is acted on. The purpose is to accumulate accuracy data against which later claims about capability can be evaluated. This phase answers the question: how reliable is this system on our actual data, in our actual environment?

Phase 2 - Assisted Mode. The agent surfaces recommendations; humans review and approve before any action is taken. The bottleneck shifts from human analysis to human review - significantly faster, but the human remains in the critical path. Data from this phase reveals the failure modes and edge cases specific to this deployment context.

Phase 3 - Supervised Autonomy. Clean cases - those that meet confidence thresholds established in prior phases - execute autonomously. Exceptions route to human queues. The human’s role shifts from reviewer of all outputs to exception handler. The organization now has empirical data on where the system is reliable enough to trust without review.

Phase 4 - Full Autonomy. Governed execution inside the sandbox, with humans managing policy and audit rather than individual transactions. The agent operates with bounded autonomy; the human organization’s role is governance, not execution. This phase is only justified by the data accumulated in phases one through three.

The structure transforms trust from a prerequisite into a product. You do not need to decide, in advance, whether to trust AI with your accounts payable process. You run shadow mode, collect data, move to assisted mode, collect more data, and let the empirical record make the decision for you. This is how you should think about governance of complex systems generally - not as a policy problem but as an evidence accumulation problem.

The Compounding Institutional Learning Problem

The final and, in some ways, most important point in this framework concerns the competitive dynamics of agentic adoption - and why the historical intuition about the wisdom of being a fast follower no longer applies.

In past technology cycles - ERP, cloud migration - second-movers often captured comparable value to first-movers. The reason is that those technologies were, at their core, software implementations: the institutional knowledge required to operate them did not compound at the rate that the technology itself improved. A company that migrated to SAP in 2008 versus 2010 did not find itself at a permanently unbridgeable capability disadvantage by 2015.

Agentic AI is structurally different, because the value of the system is not primarily in the software. It is in the accumulated institutional memory - the thousands of validated exception patterns, the calibrated confidence models, the learned organizational context - that the system builds through actual deployment. An early-moving organization accumulating agentic experience today is building a data flywheel that grows more valuable compounding over time. A late mover cannot purchase that flywheel. It must be grown from scratch, from the beginning of the learning curve, in an environment where competitors are already operating at phase three or four maturity.

You cannot buy a fast-track to years of accumulated agentic experience. Every month of delay is not just delayed value - it is lost institutional learning that competitors are actively accumulating right now.

This is not an argument for recklessness. The governance architecture described above exists precisely to make disciplined, phased deployment possible and safe. But it is a sharp argument against treating agentic AI as a technology to evaluate seriously in twelve to eighteen months. The organizations beginning phase-one shadow deployments today are not just capturing early value - they are building the institutional knowledge base that will constitute a genuine competitive moat as capability matures.

What This Actually Means

The framework described here is not primarily a technology brief. It is an organizational design argument. The thesis is that the obstacles to deploying autonomous AI in the enterprise have always been more architectural and governance-related than they have been capability-related - and that the capability gap has now closed to the point where the architectural and governance questions have become the binding constraint.

The implication is that the organizations most likely to succeed with agentic AI are not necessarily those with the most sophisticated technical teams. They are the ones that approach deployment as a governance design problem: how do we build the sandbox that lets the reasoning capability operate within our compliance boundary? How do we design the progression through phases that produces the empirical evidence we need to expand autonomy responsibly? How do we structure human approval gates so that reviewers are genuinely informed rather than effectively rubber-stamping?

These are hard questions. But they are tractable ones - which is precisely what makes this moment feel different from the prior waves of enterprise AI investment that generated more hype than operational transformation. The gap of judgment has always been the hardest part of enterprise operations. For the first time, we have technology that can operate inside it, and an architecture that makes that operation controllable. The question is whether organizations have the governance imagination to use it.

❤️ If you enjoyed this article, give it a like and share it with your peers.

Why AI Agents Disappoint

Maria Sukhareva — Thu, 13 Nov 2025 20:18:04 GMT

Welcome everyone! It has been a while since I featured a guest, but here we are - this time with a piece from Maria. Maria is a Principal AI Expert at Siemens and runs the AI Realist on Substack where she offers her honest view on today’s AI technology. She has been around Natural Language Processing (NLP) for over 15 years and has witnessed the evolution of this field first-hand.

As always, keep in mind that I don’t choose guest posts to validate my own opinion - I choose posts that I personally find interesting and that I think will be valuable to my audience (which is you). Obviously, I wouldn’t pick something I fundamentally disagree with, but diversity of opinions is important for a healthy discourse and there are too few voices of reason in a room full of noise. Anyway, the stage is hers.

For the past year, we’ve heard on multiple occasions that AI agents are going to save the day. LLMs can’t count the “r’s” in strawberries? Agents will! LLMs can’t send your emails? Agents will! LLMs can’t stop hallucinating? Agents, agents, agents…

The “agentic this, agentic that” noise is so strong that managers simply hand their IT departments vague tasks - “build something with agents” - and then go full surprised Pikachu face when it turns out the agents aren’t particularly useful.

Andrej Karpathy recently said:

They just don’t work. They don’t have enough intelligence, they’re not multimodal enough, they can’t do computer use and all this stuff

Amen to this! Finally, a word of reason amid the endless hype!

He added:

They don’t have continual learning. You can’t just tell them something and they’ll remember it. They’re cognitively lacking and it’s just not working… they are just cognitively lacking and are not working… It will take about a decade to work through all of those issues

Let me break down what is cognitively lacking and not working, and why it is going to take at least a decade to fix. I hope that next time when a marketer tries to sell you another autonomous agent that will replace your employees, boost efficiency, and whatnot, you can use these arguments to challenge their claims.

But first, let me tell you about the one place where agents currently work well: coding. They are an absolute must to introduce to your developers. AI-assisted coding works incredibly well. Tools such as GitHub Copilot Agents, Claude Code, and others have shown that they can boost developers’ productivity, take over routine tasks, and help teams code faster.

What is an agent

The best way to cut through all the hype around “agents” being autonomous, proactive decision-makers, or whatever marketing label happens to be trending, is to look at how agents are actually trained and built. This approach quickly demystifies what they are and reveals almost immediately why they fall short of the hype.

Modern LLMs are trained to select and use tools [source, source], enabling what is often called agentic behavior or agentic workflows. Let’s take a look at a typical dataset used to train LLMs for tool selection:

Here is an example from glaiveai/glaive-function-calling-v2, a commonly used open-source dataset for training function calling. As you can see in this dataset, the subset of tools is first defined in the system prompt. The system prompt lists the functions the model can access. Then there are examples of conversations that should trigger function execution. In effect, the model is trained to pick from a limited subset of tools specified in the system prompt.

More elaborated approach is proposed in the paper Plan-and-Act. The idea is basically straightforward, train a system that:

plans,
acts,
observes,
repeats

In theory, these systems can self-correct. When planning incorporates chain-of-thought reasoning, it offers a sort of “explanation” of why a tool was selected, but not an actual explanation because it all is just generation and can be hallucinated. These approaches are used for coding agents, and they work well in that setting. Unfortunately, that is not yet the case for general agents beyond coding.

In practice, an agent as currently built is far less intelligent than one might expect. It is essentially a next-token predictor that performs a sequence of reasoning steps and then selects from a predefined tool set. This can be powerful when tasks reduce to clear IF…THEN rules, instructions are simple, and the tool set and its combinations are relatively small.

And, indeed, recent studies show how poorly the agents perform in real world settings:

The paper WebArena: A Realistic Web Environment for Building Autonomous Agents notes that much prior evaluation occurs in sanitized, synthetic settings; when agents are tested in a more realistic web environment, success drops to about 11–14%, compared with 78% for humans.

Another paper, ASSISTANTBENCH: Can Web Agents Solve Realistic and Time-Consuming Tasks? (EMNLP 2024), confirms these results, showing that no agent surpasses roughly 26% accuracy on realistic web tasks.

Now let’s go through the challenges:

Lack of proactive learning

As you can see, nothing in this architecture assumes proactive learning. Suppose you ask the model once to find the best contributors on GitHub. It thinks for a while, makes mistakes, plans an approach, executes it with a few errors, corrects them, you help correct it, and finally it produces the right answer.

The next time you ask the question, the model will know nothing about what happened previously and may repeat the same mistakes or make new ones. There is currently no built-in proactive learning for AI agents.

Error propagation through LLMs

The pipeline relies heavily on planning, and the steps are sequential. It plans, executes, observes, and can repeat this cycle several times. If it makes a mistake in an early cycle, that error can propagate while the system tries to self-correct. There is no native mechanism to roll back to the failure point; an LLM simply keeps generating the next token. It cannot jump back N tokens and restart generation from there, so it may keep producing output and drift further off course.

There are mitigation mechanisms (branching search, self-consistency, Tree-of-Thought style exploration, self-critique and replanning, multi-agent debate, external memory) that can be built on top of the LLM, for example, with prompting. This should be rather viewed as a bandaid. And therefore, peer-reviewed and recent papers show that errors still snowball in multi-step reasoning and planning, and these methods often do not prevent accumulation of earlier mistakes. Also, a very recent analyses of multi-agent debate report that agents can amplify one another’s mistakes and decrease final accuracy, for example by conforming to persuasive but incorrect reasoning.

Absence of multimodality

Coding agents work well also because they do not rely heavily on multimodality. They might occasionally consult a diagram, but most of the necessary information is expressed in natural language or code. Even better, much of it can usually be found in the same environment (e.g., your IDE), which helps the model complete its task.

It is different with real-world agents. Imagine an agent performing annual financial audit checks - this is, for now, unrealistic. Such an agent would need access to many systems: structured logs, invoices, receipts, regulations and laws, and possibly the employee directory. These data types span multiple modalities: text, speech, tables, charts, images, videos, and they do not all reside in one environment. You might need to email vendors for invoices; receipts could be stored in a database; logs might be in the cloud; and relevant laws may need to be retrieved from the internet.

The current agents simply lack the ability to juggle modalities and adapt to processes that cannot be easily defined with a small set of planned steps. In a scenario where an agent discovers that certain receipts are missing or unreadable, the best case is that it stops processing; the worst case is that it behaves sycophantically and tries to finish the process anyway, hallucinating details and ultimately producing inaccuracies that are hard to trace.

Absence of suitable processes for agent integration

The process above is essentially built with a human in mind as the actor. Humans have common sense; they know when to send an email and when to look in the database.

They learn proactively: If they emailed Martin once and he said he has no idea where the invoices are, they won’t email him again 30 seconds later with the same question.

Agents cannot do that. They need processes defined for agentic workflows: the best case is when the process can be written as IF…THEN rules; the next tier is when you can define the process in natural language with instructions. If none of this is possible and much of what you do depends on the outcome of each step, you then decide dynamically what action to take. That action might be completely new - like realizing that all the invoices were moved from Martin’s PC to a newly introduced database -and you need to request permissions to access it. No agent can do this, obviously.

Here’s how the processes should be adjusted to the current agents:

Let us briefly look at the levels of automation:

Many companies’ marketing pitches for agentic solutions create the false impression that automation is already near the observer level. That would be exciting, but in reality the level of automation is closer to collaborator at best, probably at the very beginning. Coding agents are somewhat further along, crossing to the consultant level.

It will take a long time for expectations to be met, and progress slows as we approach the finish line - the so-called “last-mile” problem.

Reaching full automation, where the human is merely an observer, is extremely challenging and represents the true last mile of agentic AI.

Andrej Karpathy estimates 10 years; I would be surprised if I see it in my lifetime.

I hope you enjoyed this article. Don’t fall for hype and false promises. We can profit from technology only if we understand its limits and apply it to the right use cases.

❤️ If you enjoyed this guest article, give it a like and share it with your peers

Microsoft's Biggest Bet on Agents... Yet

Pascal Biese — Sat, 18 Oct 2025 14:42:34 GMT

In the quiet corridors of software architecture, something fundamental is shifting. Microsoft’s recently released Agent Framework - a convergence of their Semantic Kernel and AutoGen projects - is more than just another developer tool. It’s a crystallization of years of struggle with a deceptively simple question: How do we teach machines not just to respond, but to orchestrate?

The answer, embedded in thousands of lines of carefully considered code, reveals something crucial about how we’re reimagining the relationship between human intention and computational execution.

Beyond the Chatbot Paradigm

We’ve spent the past two years entranced by conversational AI. Ask a question, receive an answer. But this interaction model - elegant in its simplicity - obscures a more complex reality. Real work doesn’t happen in isolated question-answer pairs. It happens in workflows: interconnected sequences of decisions, handoffs, validations, and transformations that weave through organizations like invisible threads.

Microsoft’s Agent Framework acknowledges this explicitly. Where earlier AI frameworks focused on making individual agents more capable, this framework asks a different question: How do we coordinate multiple agents to accomplish what no single agent could achieve alone?

The distinction matters more than you might think.

The Graph Beneath the Surface

At the heart of the Agent Framework lies a graph-based architecture that’s both technically precise and philosophically revealing. Instead of treating agent interactions as reactive events - things that happen to agents - the framework models them as explicit data flows through a directed graph.

Consider the components:

Executors are nodes that receive input messages, perform their assigned tasks, and produce outputs. These can be AI agents powered by language models, or they can be deterministic functions - plain code that does exactly what it’s told. The framework doesn’t care. Both are first-class citizens in the workflow graph.

Edges define how messages flow between executors. They can be simple (A → B), conditional (A → B if X, else C), or dynamic (A → some agent decides → ?). This explicitness - making the routing logic visible and manipulable - transforms opaque agent interactions into legible, debuggable workflows.

Checkpointing captures workflow state at strategic moments, enabling long-running processes to pause, resume, and recover from failures. This isn’t glamorous technology, but it’s essential for production systems where reliability matters more than novelty.

What emerges is something rarely seen in AI frameworks: a system designed not for demos, but for the messy reality of enterprise workflows where processes span hours, days, or weeks, where human approval is required at specific junctures, and where failure recovery isn’t optional.

Five Patterns, Infinite Combinations

The framework codifies five orchestration patterns, each suited to different coordination challenges:

Sequential Orchestration

Agents execute in order, each building on the previous agent’s output. Document review workflows are canonical examples: a summarization agent processes the raw text, a translation agent converts it to the target language, and a quality assurance agent validates the result. Linear, predictable, transparent.

Concurrent Orchestration

Multiple agents work in parallel on the same input, then their results are aggregated. Think of gathering pricing data from multiple suppliers simultaneously, or collecting diverse analytical perspectives on the same dataset. The workflow doesn’t wait for sequential completion - it exploits parallelism where tasks are independent.

Handoff Orchestration

Context-driven transfer of control between specialized agents. A customer support agent handles general inquiries but hands off to a technical expert when the conversation requires deep domain knowledge, then potentially to a billing specialist if payment issues arise. The routing logic responds dynamically to how the conversation unfolds.

GroupChat Orchestration

Agents collaborate in a shared conversational space, building on each other’s contributions until they converge on a solution. This pattern mimics how human teams brainstorm: multiple perspectives in dialogue, where the final answer emerges from collective deliberation rather than individual analysis.

Magentic Orchestration

Perhaps most intriguing: a manager agent coordinates a team of specialists for complex, open-ended problems. The manager doesn’t just route messages - it builds and refines a dynamic task ledger, creating goals and subgoals as understanding evolves. When the approach is unclear at the outset, this pattern provides structure without over-constraining the solution space.

These patterns aren’t mutually exclusive. Real workflows often combine them: sequential processing might hand off to a magentic manager for a complex subtask, which spawns concurrent agents for parallel data collection, which ultimately feeds back into the main sequence.

What Agentic Design Actually Means

Here’s where we need to think carefully. “Agentic” has become AI buzzword territory - overused and under-examined. But the Agent Framework’s architecture suggests a more precise definition:

Agentic systems are those where autonomous components make context-dependent decisions about what to do next, coordinated through explicit workflow graphs that make their collaboration patterns legible and controllable.

Three elements matter:

Autonomy: Individual agents reason about their tasks and decide which tools to invoke, but they operate within constrained boundaries defined by their instructions and available capabilities.
Coordination: The workflow graph explicitly defines how agents interact. This isn’t emergence from the bottom up—it’s intentional design from the top down.
Legibility: The framework makes agent behavior observable through OpenTelemetry integration. Every tool invocation, every agent handoff, every routing decision generates structured telemetry. You can see what your system is doing and why.

This balance - autonomy within structure, intelligence within guardrails - is what makes agentic systems viable for production use. Pure emergence is fascinating in research contexts but terrifying in systems that process financial transactions or medical records.

The Hidden Assumption

There’s an assumption embedded in workflow-based agent architectures that we should examine critically: the belief that complex objectives can be decomposed into manageable subproblems whose solutions can be recombined into an overall solution.

This assumption - inherited from classical software engineering - doesn’t always hold. Some problems resist decomposition. Some emergent properties only appear at higher levels of organization and can’t be achieved by assembling solutions to component problems.

Yet for a vast category of enterprise tasks, decomposition works remarkably well. Financial reporting pipelines. Supply chain automation. Customer onboarding sequences. Content creation workflows. These are naturally compositional: they consist of identifiable steps that can be extracted, optimized individually, and reassembled.

The Agent Framework’s bet is that most business value in AI agents will come from orchestrating these compositional workflows, not from magical emergent behavior. It’s a pragmatic bet, but one grounded in decades of experience building enterprise systems.

Type Safety in an Uncertain World

One technical detail deserves attention: the framework’s emphasis on strong typing and type-based routing. Messages flowing through the workflow graph carry type information that determines which edges they can traverse and which executors can process them.

This might seem like pedantic engineering in the age of neural networks that blur all boundaries. But consider what it enables:

Validation before execution: Type mismatches are caught at workflow design time, not discovered during production runs.
Clear contracts: Each executor declares what inputs it expects and what outputs it produces. No guessing.
Safer composition: When you connect executors, the type system ensures they can actually communicate.

In a domain defined by probabilistic outputs and unpredictable agent behavior, type safety provides an anchor. The messages might be generated by neural networks, but their flow through the system follows deterministic rules you can reason about.

Human-in-the-Loop: Acknowledging Limits

The framework’s support for human-in-the-loop workflows deserves recognition. Certain operations - approving financial transactions over a threshold, making hiring decisions, publishing content to external audiences - shouldn’t be fully automated regardless of how capable our agents become.

The framework’s checkpointing system enables workflows to pause at designated points, expose their current state for human review, and resume after approval. This isn’t a concession to current AI limitations - it’s acknowledgment of a permanent constraint: some decisions carry consequences that require human accountability.

The architectural support for this pattern matters because it normalizes mixed-initiative systems where humans and agents collaborate rather than compete. The workflow doesn’t terminate when it needs human input - it pauses gracefully and provides context for the decision required.

What Executives Need to Understand

If you’re leading an organization considering AI agent deployment, Microsoft’s framework reveals several strategic implications:

Guided Autonomy: Progressive Trust Is All You Need

Pascal Biese — Thu, 18 Sep 2025 15:10:23 GMT

Last week, I watched an AI agent try to book a flight. It was a silly experiment since I personally do not have interest in booking flights via autonomous AI, but it’s a suitable use case for demonstrating a fundamental issue: poor agency scoping.

So, what happened? Within three minutes, it had opened seventeen browser tabs, attempted to purchase business class tickets to Paris (I wanted economy), and somehow ended up researching the history of French aviation. This wasn't a broken, sloppy prototype - it was a sophisticated agent with access to web browsing, memory, and calendar integration. It just had too much autonomy, too soon.

This experience stands for something I've been witnessing in dozens of projects that included building AI agents: most people tend to give these systems the keys to the kingdom before they've learned to open doors. The result? A crisis of trust that's holding back the entire field of agentic enterprise AI.

But there's a better way. Over the past year, working with teams at dozens of companies from legacy enterprises to tech startups, a pattern has emerged. The teams succeeding with AI agents aren't the ones building the most sophisticated systems. They’re the ones implicitly following what I started to call the Principle of Least Autonomy (PoLA, borrowed from the Principle of Least Privilege in information security) - a framework that treats agent development like training a new team member, where trust and independence are earned progressively through demonstrated competence.

The Agency Paradox: Why More Capability Means Less Control

Here's a truth about AI agents that decision makers tend to overlook: every increase in agency creates a corresponding decrease in control. This is not a bug, it's inherent. When you give an AI system the ability to make decisions and take actions on your behalf, you necessarily give up some ability to predict and constrain its behavior.

Traditional software doesn't work this way. When you write if (user.clicks) then (open.menu), you know exactly what will happen. But when you tell an AI agent to "handle customer support tickets," you're entering a world of non-deterministic behavior where the same input might generate wildly different outputs depending on context, training, and what the model had for breakfast (metaphorically speaking).

This creates an agency-control tradeoff - think of it as a seesaw:

High control, low agency: The AI suggests responses, humans review everything (copilot mode)
Medium control, medium agency: The AI makes decisions within boundaries, escalates edge cases
Low control, high agency: The AI operates independently, humans monitor outcomes

Most teams jump straight to the high-agency end because that's where the promised productivity gains live. It's also where things spectacularly fall apart.

The Principle of Least Autonomy: Start Smaller Than You Think

The Guided Autonomy framework begins with a counterintuitive principle: give your AI agents the minimum autonomy necessary to demonstrate value, not the maximum autonomy technically possible.

This is the opposite of how most teams approach agent development. Instead of asking "what's the most sophisticated thing we can build?", ask "what's the simplest useful behavior we can verify?"

Consider how Anthropic approached Claude's new computer-use capability. They didn't start by letting it control entire workflows. First, it learned to take screenshots. Then to identify UI elements. Then to click buttons. Only after proving competence at each level did it earn the right to chain these actions together. Even now, it operates with extensive guardrails and requires explicit user permission for sensitive actions.

This incremental approach might seem slow, but it's actually faster than the alternative: rebuilding trust after your agent goes rogue. A company I worked with learned this the hard way when their customer service agent achieved a 94% resolution rate but also promised refunds to anyone who mentioned the word "disappointed." I had to come in for the cleanup.

The CC/CD Framework: How Agents Actually Learn

Building reliable agents requires a fundamentally different development lifecycle than traditional software. Machine Learning models need constant calibration, AI agents even more so, as the outputs they can create are usually much more complex (classifications vs. actions). Continuous Calibration (CC), which has emerged as an AI-specific variant of CI/CD, addresses this challenge. If CI/CD is about shipping code reliably, CC/CD is about shipping behavior reliably.

Continuous Development: Defining the Ladder Rungs

A key part of the Guided Autonomy development framework is about defining capability levels for AI agents. Think of it as creating a trust ladder with clearly defined rungs:

Level 1 - Observe and Report Your agent watches workflows and identifies patterns but takes no action. A sales agent might analyze email threads and flag follow-up opportunities without sending anything.

Level 2 - Draft and Suggest The agent creates content but requires human approval. That sales agent now drafts responses but waits for your review.

Level 3 - Act with Boundaries
The agent operates independently within strict constraints. It might send follow-ups to existing conversations but can't initiate new ones.

Level 4 - Autonomous with Oversight Full agency with exception handling. The agent manages entire workflows but escalates unusual situations.

Each level needs three components:

Capability scope: What exactly can the agent do?
Success metrics: How do we measure competence?
Graduation criteria: What proves readiness for more autonomy?

Continuous Calibration: Learning from Reality

Here's where AI agents diverge sharply from traditional software. You can't test all behaviors in advance because you don't know all the behaviors. Instead, you need continuous calibration - watching real-world performance and adjusting accordingly.

The calibration loop looks like this:

Run → Measure → Analyze → Adjust → Repeat

But not all measurements are created equal. The teams I've seen succeed focus on what I call "trust indicators":

Accuracy metrics: Is the agent correct?
Alignment metrics: Does it match your intentions?
Safety metrics: Does it avoid harmful actions?
Efficiency metrics: Does it improve outcomes?

A credit card company using this approach discovered their fraud detection agent prototype was 99.2% accurate but had developed an unexpected bias against purchases from craft stores. The calibration process caught this before it affected customers, allowing them to adjust the training data and constraints.

The Measurement Challenge: When Good Enough Isn't

One of the hardest parts of Guided Autonomy (and AI development in general) is defining "good enough." When does an agent graduate from Level 2 to Level 3? When is 90% accuracy sufficient, and when do you need 99.9%?

The answer depends on what I call the "trust equation":

Trust = (Competence x Consistency x Recoverability) / Consequence

Competence: How well does it perform the task?
Consistency: How predictable is performance?
Recoverability: How easily can errors be fixed?
Consequence: What's the cost of failure?

An agent summarizing meeting notes might graduate at 85% accuracy because errors are easily caught and consequences are minor. An agent executing financial trades might need 99.95% accuracy with multiple safeguards because consequences are severe and potentially irreversible.

This is why the evaluation infrastructure becomes critical. You need:

Observability: See what your agents are actually doing
Traceability: Understand why they made specific decisions
Controllability: Intervene when necessary
Reversibility: Undo actions when things go wrong
Damage Control: If reversibility is not possible, potential damage needs to be minimized by design

Progressive Autonomy in Practice: Three Implementation Patterns

Having worked with teams implementing Guided Autonomy, I've identified three patterns that consistently succeed:

Pattern 1: The Shadow Mode Strategy

Before giving an agent any real autonomy, run it in shadow mode alongside human workers. The agent performs all actions but doesn't execute them - it just logs what it would have done.

A logistics company used this approach for route optimization. For two months, their AI agent "shadowed" human dispatchers, generating routes that were compared against human decisions. Only when the agent consistently outperformed humans on efficiency while maintaining delivery success rates did it earn the right to make real routing decisions.

Shadow mode provides two critical benefits:

Risk-free learning from real-world complexity
Direct comparison against human baseline performance

Pattern 2: The Gradual Handoff

Instead of replacing entire workflows, gradually hand off specific subtasks as the agent proves competence.

A marketing team I advised took this approach with content creation:

Week 1-4: Agent suggests blog topics
Week 5-8: Agent creates outlines for approved topics
Week 9-12: Agent writes introductions and conclusions
Week 13-16: Agent drafts complete posts for review
Week 17+: Agent publishes directly for certain content types

Each handoff was contingent on success metrics from the previous phase. The agent earned autonomy through demonstrated competence, not arbitrary timelines.

Pattern 3: The Circuit Breaker Architecture

Build automatic constraints that kick in when agents exhibit unexpected behavior.

A financial services team set up “circuit breakers” for their crawling agent:

If error rate exceeds 3%: switch to manual review before accepting fetched documents
If robots.txt/sitemap rules change or access is denied: pause crawl and alert compliance
If download volume or new-source discovery spikes beyond baseline: require human approval to proceed
If entity/figure confidence falls below threshold (e.g., EPS, revenue, guidance): switch to human-in-the-loop extraction

These circuit breakers act like safety nets - enabling autonomy while preventing costly errors, rate bans, or compliance breaches. They’re not necessarily permanent: as the agent proves reliable, you can widen thresholds or retire specific checks.

The Trust Events That Matter

Through observation and experimentation, I've identified five critical "trust events" that determine whether an agent successfully earns autonomy:

1. The First Failure How the agent handles its first significant error reveals its true reliability. Does it recognize the failure? Can it recover gracefully? Does it learn from the mistake?

2. The Edge Case Test When confronted with scenarios outside its training distribution, does the agent default to safe behavior or hallucinate solutions?

3. The Delegation Decision When the agent first decides to delegate to a human rather than guess, it demonstrates judgment beyond mere pattern matching.

4. The Consistency Check After 100, 1,000, or 10,000 operations, does performance remain stable or degrade? Consistency over time matters more than initial accuracy.

5. The Audit Moment When you review the agent's historical decisions, do they make sense in retrospect? Can you understand its reasoning?

Teams that explicitly monitor and evaluate these trust events make better autonomy decisions than those relying on aggregate metrics alone.

When Not to Use Guided Autonomy

Despite its benefits, Guided Autonomy isn't always the right approach. Skip it when:

Stakes are permanently low: If errors don't matter, complex trust-building might be overkill
Full automation is impossible: Some tasks require human judgment that can't be earned
Speed trumps safety: In true emergency scenarios, you might need to grant emergency autonomy
Learning is the goal: Research environments where understanding limits matters more than production reliability

But these exceptions are rarer than most teams assume. Even "simple" tasks benefit from progressive trust-building.

The Path Forward: Building Your Trust Ladder

If you're building AI agents - and increasingly, who isn't? - here's how to implement Guided Autonomy in your projects:

Step 1: Map your agency levels Define 3-5 clear autonomy levels for your use case. Each should be measurably different from the others.

Step 2: Instrument everything You can't manage what you can't measure. Build observability from day one, not as an afterthought.

Step 3: Start lower than comfortable Your first deployment should feel almost uselessly conservative. That's the point - you're building foundation, not showing off.

Step 4: Define graduation criteria Before deploying at any level, document exactly what success looks like and what earns advancement.

Step 5: Calibrate continuously Schedule regular reviews. Weekly at first, then monthly as patterns stabilize.

Step 6: Communicate transparently Users should understand what autonomy level they're interacting with and why.

The Compound Effect of Earned Trust

Something remarkable happens when you follow these principles: trust compounds. Each successful interaction makes the next one more likely to succeed. Users who see agents earn their autonomy become partners in the process rather than skeptics of the outcome.

A customer support team I worked with saw this firsthand. When they transparently showed customers that their AI agent was "Learning with your help," (Level 2) complaint rates dropped 60%. Customers became collaborators, providing feedback that accelerated the agent's advancement to Level 3.

This compound effect extends beyond individual agents. Organizations that successfully implement Guided Autonomy for one use case find subsequent implementations easier. The framework becomes organizational muscle memory.

The Future of Human-Agent Collaboration

As I write this, we're at an inflection point. The next generation of AI agents will have capabilities we can only begin to imagine - multimodal understanding, complex reasoning, long-term memory, and sophisticated tool use. But capability without trust is just potential energy.

This framework isn't about limiting AI agents - on the contrary - it's about unlocking their full potential through systematic trust-building. It's the difference between a powerful tool you're afraid to use and a reliable partner you can't imagine working without.

The teams winning with AI agents aren't the ones building the most sophisticated systems. They're the ones building trust systematically, earning autonomy incrementally, and creating agents that users actually want to work with.

Your AI agents are capable of remarkable things. But first, they need to earn the right to show you. Start with the smallest useful autonomy. Measure everything. Advance based on evidence, not hope. Build trust through demonstration, not declaration.

The ladder is there. The only question is whether you'll help your agents climb it.

What autonomy level are your AI agents operating at? Have you seen examples of earned trust in action? I'd love to hear about your experiences implementing progressive autonomy.

If you found this framework useful, consider sharing it with your team. The more organizations that adopt Guided Autonomy principles, the faster we'll all move toward truly reliable AI agents.

Coming Next: The Agency Stack - How to choose the right tools for building progressive autonomy into your AI agents, from observability platforms to evaluation frameworks.

A Sneak Peek at Microsoft's Prototyping Playbook

Aparna Chennapragada — Sat, 06 Sep 2025 14:48:19 GMT

I recently read Aparna Chennapragada’s short piece on "Prompt sets are the new PRDs", which made me listen to her appearance on Lenny’s podcast. Her ideas resonated with me and my gut feeling told me that other people would feel the same. So I reached out to her and offered her a guest post. While I mostly post about research, the changes in AI product management are something I’ve been very excited about in my daily work.

If you don’t know Aparna yet, she’s Microsoft's Chief Product Officer and responsible for AI strategy across their productivity tools. And with MS Copilot arguably being the AI tool with the highest C-level buy-in and enterprise adoption, that alone makes her a relevant voice in this space. Enjoy!

Human intent as the spec

“How would you ask your coworker for this?”

This is the..uh..prompt..I posed to folks in a meeting the other day.

I asked this not to arbitrarily anthropomorphize agents but because my intuition is this question forces the right granularity for the agents.

In some sense I see this as a variant of Ilya Sutskever’s observation: “Predicting the next token well means that you understand the underlying reality that led to the creation of that token.”

Next-token prediction worked because the task itself demanded depth: to predict words well, a model had to capture structure, context, and causality.

My hunch is that prompts that reflect human intent set a similar expectation for agents. They push the system toward usefulness rather than rote mechanics.

Good: “Tell me the most important things from the customer meeting. what were the key risks, and how did sentiment trend?”
Bad (too mechanical): “Summarize the transcript in 5 bullets.”
Bad (too tool-specific): “Run sentiment analysis on transcript.json and output JSON.”
Bad (too coarse): “Handle all my email.”

The difference lies in whether the prompt encodes judgment and priorities, the elements a human colleague would naturally understand and more importantly the level at which you would operate at.

Prompt sets as teaching tools

Traditional PRDs were written for programmers. They locked requirements down in advance, then handed them off to be built. Prompt sets work differently. They are living artifacts: part specification, part training data.

Each prompt is an example that shapes how the agent behaves. Together, I almost see them form a curriculum teaching the system what “good” looks like, how to correct mistakes, and where the boundaries are.

A multi-round game

Writing prompt sets is never a one-shot exercise. You start with a few, test them, see where the agent falls short, and refine. Each round closes the gap between what you asked for and what the system delivers.

It feels less like drafting a rigid contract and more like coaching. The spec can continuously evolve, and most importantly, I am finding that you need to put human intuition at the center guiding, adjusting, raising the bar.

That’s why I keep coming back to this idea: Prompt sets are the new PRDs. They encode intent, they teach, and they set the rhythm for iteration.

❤️ If you enjoyed this little thought piece, give it a like and share it with your peers. I’ll follow up with my own ideas in a separate article soon™.

And don’t forget to subscribe to Aparna’s new Substack.