Agent pilots are everywhere and agent production is rare. We draw on McKinsey's 2025 research, Gartner's cancellation forecast, Carnegie Mellon's benchmark evidence, and our own deployment record since 2014 to show that agents succeed in workflows with clear inputs, verifiable outputs, bounded blast radius, and a named human owner, and that they fail as open-ended copilots. This paper gives COOs and CIOs a two-question grid that sorts every candidate workflow into a deployment posture before money is spent.
The gap between agent pilots and agent production is wide, and it is not a model problem
Every COO has now seen the agent demo. An AI system reads a request, plans a task, calls a tool, and reports back. The demo is real. The production gap is also real, and it is wider than most boards have been told.
McKinsey's State of AI survey, fielded in mid-2025 across 1,993 respondents in 105 countries, finds that 62 percent of organizations are at least experimenting with AI agents, while only 23 percent are scaling an agentic system anywhere in the enterprise. The same survey finds that just 39 percent of organizations report any enterprise-level EBIT impact from AI, and most of those attribute less than 5 percent of EBIT to it. Adoption is broad. Impact is thin.
Gartner is blunter. The firm predicts that more than 40 percent of agentic AI projects will be canceled by the end of 2027, and names the causes: escalating costs, unclear business value, and inadequate risk controls. Gartner also estimates that of the thousands of vendors claiming agentic products, only about 130 offer real agentic capability. The rest, in Gartner's phrase, are agent washing: rebranded chatbots and RPA with a new label.
We read these numbers as practitioners, not as forecasters. QueryNow has built AI systems in client environments since 2014, across more than 200 production deployments. The agent projects we are asked to rescue share one pattern. The agent was scoped like a new hire: a broad mandate, vague success measures, access to everything, and no acceptance test. The agents we have kept in production were scoped like machinery: a fixed job, a measurable output, a guard on every moving part, and a person who owns the off switch.
Agents fail as employees and succeed as machinery
The most honest public evidence on open-ended agents comes from Carnegie Mellon. Researchers there built TheAgentCompany, a simulated software firm in which agents must browse the web, write code, run programs, and message coworkers to complete realistic office tasks. In the published benchmark, the most competitive agent completed 30 percent of tasks autonomously. The authors' conclusion is precise: a good portion of simpler tasks can be solved autonomously, while longer-horizon tasks remain beyond the reach of current systems.
The failure modes matter more than the score. Carnegie Mellon's researchers report agents that stalled because a pop-up window blocked a web page. Agents that invented shortcuts to skip the hard part of a task. One agent, unable to find the right colleague, renamed another user to match the name it was looking for and proceeded as if the job were done. These are not defects a bigger model fixes on its own. They are what happens when software is handed a goal it cannot verify inside an environment it does not control.
McKinsey's agentic AI research describes the same dynamic at enterprise scale and calls it the gen AI paradox: nearly eight in ten companies report using generative AI, yet just as many report no significant bottom-line impact. Horizontal copilots spread fast precisely because they demand nothing of the workflow, and they deliver diffuse gains for the same reason. Meanwhile, McKinsey finds that roughly 90 percent of function-specific, vertical use cases never leave pilot mode.
Our delivery experience matches both findings. When an agent's output cannot be checked by a rule or a second system, every result needs human review, and the labor you removed comes back as inspection. When an agent's actions are unbounded, the first incident ends the program, whatever the average performance was. The model is rarely the constraint. The workflow is.
An agent is not a junior employee. It is machinery. Machinery earns trust through guards and shutoffs, not through enthusiasm.
Four properties decide whether a workflow survives production
Across our deployments, the workflows that survive production share four properties. Each one can be checked before a line of code is written.
- Clear inputs. The agent receives complete, structured material: a document or a ticket, never a mystery. It does not have to discover what the job is.
- Verifiable outputs. A script or a second system can check the result without redoing the work. Pass or fail is computable, not debatable.
- Bounded blast radius. The worst single action the agent can take is cheap to detect and cheap to reverse. No irreversible step executes without a human trigger.
- A named human owner. One person owns the exception queue, reviews escalations, and signs for the output class. Ownership sits in the org chart, not in a committee.
Notice what these four properties are. They are the inverse of Gartner's cancellation causes. Unclear business value is what you get without verifiable outputs. Inadequate risk controls are what you get without bounded blast radius and a named owner. Escalating cost is what you get when unverifiable output turns into permanent human inspection of everything the agent produces.
Scoring a workflow against the four properties takes about an hour with the process owner in the room. We run this test before we commit to any build. Most candidate workflows fail at least one property, and that is useful information at a cost of one hour instead of two quarters.
A failed property is also a work order, not a verdict. A workflow with unclear inputs usually needs an intake form or an upstream integration, not an agent with better judgment. A workflow with unverifiable outputs needs its quality bar written down as rules, which the process owner has often carried in their head for years. In our experience the writing-down is the hard part and the valuable part. Once the rules exist, the agent build is fast, and the rules outlive the agent.
Two questions sort every candidate workflow into one of four postures
Two of the four properties do the sorting work: verifiability of output and blast radius of action. Put verifiability on one axis and blast radius on the other, and every candidate workflow lands in one of four deployment postures (Exhibit 1).
A two-by-two grid. The horizontal axis is verifiability of output, running from judgment-checked to machine-checked. The vertical axis is blast radius of action, running from contained to irreversible. Machine-checked output with contained actions: automate in full. Machine-checked output with irreversible actions: automate behind a human gate. Judgment-checked output with contained actions: the agent assists and a human decides. Judgment-checked output with irreversible actions: no agent; invest in verification and rollback engineering first.
| Quadrant | Posture | What the agent does | Example workflows |
|---|---|---|---|
| Machine-checked output, contained actions | Automate in full | Executes end to end; humans review exceptions only | Compliance scanning, document classification, data extraction, invoice matching |
| Machine-checked output, irreversible actions | Automate behind a gate | Prepares, verifies, and stages the action; a human triggers the irreversible step | Payment release, master-data changes, production code deployment, bulk customer communications |
| Judgment-checked output, contained actions | Assist, never decide | Drafts and retrieves; a named human owns every output | First-draft documents, research summaries, meeting preparation, competitive scans |
| Judgment-checked output, irreversible actions | No agent today | Nothing autonomous; build verification and rollback before reconsidering | Pricing strategy, personnel decisions, unscripted negotiation, crisis response |
The grid is not static, and this is the point most agent roadmaps miss. Workflows migrate when you change the engineering, not when models improve. Add a rule engine that scores outputs, and a judgment-checked workflow becomes machine-checked. Split an irreversible action into a staged proposal plus a human-triggered commit, and the blast radius drops a quadrant. This is ordinary systems work, and it is where agent budgets should actually go. McKinsey's survey points the same direction: its AI high performers, about 6 percent of respondents, are 2.8 times more likely than others to have fundamentally redesigned workflows, which is the strongest correlate of EBIT impact in the study.
The grid also names what to refuse. The open-ended enterprise copilot, an agent with broad goals, judgment-checked output, and wide system access, sits in the worst quadrant by construction. The Carnegie Mellon results are the controlled-environment preview of what it does there.
Our production agents cluster where verification is cheap and mistakes are small
A European pharmaceutical regulator runs our AI compliance scanner inside its marketing-asset review workflow. The system has scanned more than 620 assets, applies 11 rules per scan, and completes an asset in about two minutes, down from two to three hours of manual review. The workflow is a textbook resident of the automate-in-full quadrant. Every input is a finished asset. Every output is a rule verdict a reviewer can check directly against the rule. The blast radius of an error is one flagged asset in a queue that a human review team owns. Nothing the scanner does is irreversible.
At Rockwell Automation we built an AI-driven digital workplace for more than 28,000 employees in over 80 countries. Content findability improved 60 percent and support tickets fell 40 percent. Retrieval is a contained-blast-radius workflow by nature: a wrong answer costs one search and is corrected by the next click. And because findability and ticket volume are measured continuously, the output is verifiable in aggregate even where individual answers are judgment calls.
Notice what neither system attempts. The compliance scanner does not approve assets; it flags them, and the review team decides. The digital workplace does not answer on the company's behalf; it finds, and the employee acts. In both cases the autonomous part of the workflow is exactly the part a machine can be checked on, and the judgment stays with the people who already owned it. That boundary is not a limitation we accepted reluctantly. It is the design decision that keeps both systems in production.
Our commercial model is built on the same logic, and it doubles as a diagnostic your team can borrow. We scope one workflow, write executable acceptance criteria with the client, and sign them on day one. We build in the client's environment for two weeks. The client pays $10,000 only after every criterion passes, and larger programs run as repeated two-week sprints on the same terms. The mechanism works only because we select workflows where pass or fail is computable. When the client and we cannot write the acceptance test together, both sides have learned something at a far lower price than a failed program: the workflow is not agent-ready yet.
If you cannot write the acceptance test, you have not found an agent workflow. You have found a judgment call.
Blast radius is an engineering outcome, not a policy document
Gartner lists inadequate risk controls among the top reasons agentic projects die. McKinsey's agentic research reaches the matching prescription: agents need agent-specific governance mechanisms, not policies inherited from chatbot deployments. We agree, with one sharpening from the field. In production, governance is not a document. It is a set of engineering artifacts that either exist in the system or do not.
- Least-privilege credentials: each agent holds its own identity, scoped to the single workflow it serves, never a shared service account.
- Staged execution: irreversible actions run as proposal, human approval, then commit, with a dry-run mode for every action type.
- Append-only audit logs: every input and every action is reconstructable after the fact, at the level an auditor would ask for.
- Kill switch and rate limits: the owner can halt the agent in seconds, and no failure can compound faster than a human can notice it.
- A named owner: one accountable person per agent, with the exception queue written into their job, not into a steering committee's charter.
A mapping of the five production controls to the failure each one prevents. Scoped credentials prevent lateral damage beyond the workflow. Staged execution prevents irreversible error. Audit logs make every incident diagnosable. Kill switches and rate limits cap how fast a failure can compound. Named ownership prevents orphaned automation. Each control is observable in the running system itself, which is what separates governance from documentation.
We build these systems to SOC 2, HIPAA, and GDPR standards, and we align every implementation with the EU AI Act. For a COO or CIO in a regulated industry, the sequencing is the decision that matters: these controls are day-one build items, priced into the first sprint. Retrofitted governance is how pilot agents get frozen at pilot.
None of this is an argument against agents. Gartner, in the same forecast that predicts the cancellations, expects at least 15 percent of day-to-day work decisions to be made autonomously through agentic AI by 2028, and agentic capability inside 33 percent of enterprise software by the same year. The cancellations and the growth are one story, not two. The undifferentiated bets die. The bounded, verifiable, owned workflows compound. The leaders who industrialize workflow selection now, with a grid and an acceptance test rather than a vendor shortlist, will hold the compounding side of that ledger.
What to do with this on Monday morning
- 1. List ten candidate workflows this week and score each against the four properties: clear inputs, verifiable outputs, bounded blast radius, named owner. Discard any workflow that fails two or more.
- 2. Plot the survivors on the verifiability-by-blast-radius grid and fund only the automate-in-full and automate-behind-a-gate quadrants this quarter.
- 3. Write executable acceptance criteria for the first workflow before any code is written or any vendor is signed. If the criteria cannot be written, pick a different workflow, not a different vendor.
- 4. Assign a named owner to every agent already running in your environment. Switch off any agent nobody claims within two weeks.
- 5. Audit each production agent's credentials down to the single workflow it serves, and add staged execution to every irreversible action it can take.
- 6. Re-score the portfolio quarterly, and budget engineering time to move one judgment-checked workflow to machine-checked each quarter. That migration, not model upgrades, is where autonomy is earned.
- McKinsey & Company, Seizing the agentic AI advantage (2025)
- McKinsey & Company, The state of AI in 2025: Agents, innovation, and transformation (2025)
- Gartner, Gartner Predicts Over 40% of Agentic AI Projects Will Be Canceled by End of 2027 (2025)
- Carnegie Mellon University School of Computer Science, Simulated Company Shows Most AI Agents Flunk the Job (2025)
- Xu et al., TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks, arXiv (2024)