Enterprise AI adoption is near universal, yet measured profit impact remains rare, and project abandonment more than doubled in a single year. We argue the cause is structural rather than technical: most pilots are scoped so that nothing about them is falsifiable, so they can neither pass nor fail, only continue. From more than 200 production deployments since 2014, we show how executable acceptance criteria, backed by payment that is owed only when they pass, convert pilot theater into shipped systems.

Adoption is near universal; measured impact is not

Begin with two numbers that should not coexist. McKinsey's State of AI survey finds that 88 percent of organizations now use AI in at least one business function, and 62 percent are at least experimenting with AI agents. The same survey finds that only 39 percent of respondents report any enterprise-level EBIT impact from AI. McKinsey's definition of an AI high performer, an organization attributing 5 percent or more of EBIT to AI alongside significant reported value, fits roughly 6 percent of respondents.

The trend line is worse than the snapshot. S&P Global Market Intelligence's Voice of the Enterprise survey, fielded across more than 1,000 IT and business respondents in North America and Europe, finds that the share of companies abandoning most of their AI initiatives jumped from 17 percent to 42 percent in a single year. The average organization scrapped 46 percent of its AI proofs of concept before they reached production. Respondents named cost as a leading obstacle, followed by data privacy and security risk.

Gartner saw it coming. In July 2024 the firm predicted that at least 30 percent of generative AI projects would be abandoned after proof of concept by the end of 2025, and named the causes: poor data quality, inadequate risk controls, escalating costs, or unclear business value. The prediction proved conservative. Gartner also put the cost of these initiatives at 5 million to 20 million dollars apiece, depending on approach, which is what makes the abandonment rate a CFO problem rather than a lab curiosity.

Then came the result that made headlines. MIT's NANDA initiative, in its 2025 report The GenAI Divide: State of AI in Business, drew on 150 leadership interviews, a survey of 350 employees, and an analysis of 300 public deployments. Its conclusion: roughly 95 percent of enterprise generative AI pilots deliver no measurable P&L impact. About 5 percent achieve rapid revenue acceleration. The rest stall (Exhibit 1).

Exhibit 1: The closer research gets to measured P&L, the higher the failure rate climbs

Four independent research efforts, arranged from prediction to measurement. Gartner (July 2024) predicted at least 30 percent of generative AI projects abandoned after proof of concept by end of 2025. S&P Global Market Intelligence (2025) measured 42 percent of companies abandoning most of their AI initiatives, up from 17 percent the prior year, with an average of 46 percent of proofs of concept scrapped before production. McKinsey (2025) found 61 percent of organizations report no enterprise-level EBIT impact from AI at all. MIT NANDA (2025) found roughly 95 percent of generative AI pilots deliver no measurable P&L impact. The stricter the financial yardstick, the worse the picture.

Note what did not cause this. Models improved every quarter across the survey period. Budgets grew. Tooling matured. The failure rate rose anyway. When outcomes get worse while inputs get better, the defect is not in the technology. It is in how the work is structured.

A pilot that cannot fail cannot succeed

Consider the anatomy of a standard enterprise AI pilot. A sandbox environment, detached from production data and production permissions. A use case chosen for visibility rather than measurability. A success definition written in adjectives: the pilot should demonstrate value, validate feasibility, generate learnings. A final demo to a steering committee, after which everyone agrees the results are promising and approves another phase.

Nothing in that structure can fail. There is no test the pilot could flunk, no number that would force a kill decision, no artifact a skeptic could run to check the claim. In the language of science, the pilot is unfalsifiable. And an unfalsifiable pilot cannot succeed either, because success and failure are both verdicts, and verdicts require a standard set in advance. What the pilot produces instead is a feeling, and feelings renew budgets without ever shipping software.

A pilot with no failure condition is not an experiment. It is a subscription.

The incentives hold this structure in place. The vendor wants the engagement extended. The sponsor wants optionality preserved and embarrassment avoided. The systems integrator bills by the month either way. Nobody in the room is paid to make the pilot falsifiable, so nobody does. The cost lands later, on the CFO, as a portfolio of initiatives at Gartner's 5 to 20 million dollars each, a large share of which will be quietly written off.

We have built AI systems since 2014 and have more than 200 of them in production. The pattern predates generative AI; we watched the same loop run on earlier waves of chatbots and machine learning platforms. The technology changes. The unfalsifiable pilot survives every wave, because it is not a technical artifact. It is a contractual one.

The research blames integration; an accountability gap sits underneath

MIT's researchers locate the failure in what they call a learning gap. Generic tools work well for individuals because individuals adapt around them. Enterprises need systems that adapt to the workflow and retain feedback over time, and most pilots never get there. The report also documents a budget misallocation: more than half of generative AI spend targets sales and marketing tools, while the strongest returns sit in back-office automation, where outsourced cost can actually come out.

One MIT finding deserves more attention than it has received. Purchased solutions and external partnerships reached successful deployment about 67 percent of the time in the report's data. Internal builds succeeded at roughly a third of that rate. The conventional reading is that vendors have better technology. We read it differently, from the builder's side of the table: an outside firm can be put on falsifiable terms, with signed criteria and a payment trigger, and an internal team almost never is. Accountability, not capability, is the variable that moves.

In our experience, integration fails at the moment of definition, months before the first API call. 'The assistant integrates with our service desk' is a hope. 'The assistant resolves a defined ticket category end to end in the production service desk, verified by script against 100 held-out tickets at an agreed accuracy threshold' is a requirement a builder can be held to. Nearly every stalled pilot we have been asked to replace was missing the second sentence.

Acceptance criteria must execute, not impress

A criterion is executable when a script, not a meeting, decides whether it passed. That requires four properties. It names a dataset, held out and client-owned, so the system cannot be tuned to the test. It states a numeric threshold, so the pass line exists before the work starts. It runs in the client environment, on real permissions and real infrastructure, so a pass means production-grade rather than demo-grade. And it produces an artifact, a logged test run, that a skeptic can re-execute without trusting anyone in the room.

Writing criteria this way is harder than it sounds, and the difficulty is the value. Most pilot scopes cannot survive the translation. 'Improve knowledge worker productivity' has no executable form, which is the earliest possible signal that the pilot will produce a feeling rather than a system. The discipline forces a sharper question: which single workflow, measured how, against what baseline? (Exhibit 2)

Exhibit 2: An executable criterion moves the pass or fail decision from a meeting to a script

Anatomy of an executable acceptance criterion, shown against its vague counterpart. The vague form ('the assistant should improve search') routes the verdict to a steering committee after a demo. The executable form names a held-out dataset (for example, 200 sampled employee questions), a numeric threshold (at least 90 percent answered with a correct source citation), an environment (client tenant, production permissions), and a runtime budget, and it produces a logged test run that anyone can re-execute. The decision point shifts from opinion after the demo to evidence before the invoice.

Vague pilot goals fail quietly and late; executable criteria fail loudly and early

Typical pilot goal	Executable acceptance criterion
Improve contract review efficiency	Extract the 14 defined clause types from 50 held-out contracts at 95 percent field accuracy or better, scored by script against counsel-approved answers
Demonstrate AI search on our knowledge base	Answer 200 sampled employee questions with a correct source citation in at least 90 percent of cases, measured in the client tenant on production permissions
Validate AI for marketing compliance review	Apply every codified review rule to 100 historical assets and match the human reviewer verdict on at least 90 percent, in under 3 minutes per asset
Explore agents for the service desk	Resolve one named ticket category end to end against 100 held-out tickets at the agreed accuracy threshold, inside the production service desk

This is not theoretical. A European pharmaceutical regulator runs an AI compliance scanner we built for its marketing-asset review. The system applies 11 codified rules per scan and processes an asset in about two minutes; the manual review it replaced took two to three hours per asset. It has now scanned more than 620 assets in production. Every one of those scans is itself a falsifiable event: each rule either fired correctly or it did not, and the runtime either held or it did not. A system born from executable criteria keeps generating its own evidence for as long as it runs.

Payment tied to passing moves the risk to the builder

Criteria alone are not enough, because criteria get renegotiated the moment they start to bind. The mechanism that makes them bind is commercial. Our standard terms: we scope one workflow with the client, sign executable acceptance criteria on day one, build inside the client's environment for two weeks, and the client pays $10,000 only after every criterion passes. No pass, no invoice. Larger programs run as repeated two-week sprints on the same terms.

For the CFO, this converts an open-ended program into a priced, contingent purchase. The maximum downside of testing a workflow is two weeks of internal attention. Spend recognizes only against verified capability. The AI line item stops behaving like research expense and starts behaving like procurement, with a unit price, a deliverable, and a pass condition.

For scoping discipline, the effect is sharper still. When the builder eats the miss, the builder refuses unverifiable scope, politely and early. The negotiation over criteria surfaces exactly the defects Gartner and S&P Global catalogue, poor data quality and unclear business value above all, before money moves rather than after. A builder on these terms will sometimes decline the work because the data cannot support a passable test. That conversation costs the client nothing. The conventional version of it costs millions and arrives 18 months late.

Spend that recognizes only against verified capability is the cheapest AI governance a CFO can buy.

The common objection is that two weeks is too short for enterprise AI. It is too short for a platform, and exactly right for one workflow. The two-week constraint is not an aesthetic preference; it is what keeps criteria honest. Scope that cannot pass a test in two weeks is scope that has not been decomposed, and undecomposed scope is where unfalsifiability hides.

Scale is a sequence of passed tests, not a leap

McKinsey's survey offers the clearest external evidence for this route. AI high performers are 2.8 times more likely than others to have fundamentally redesigned workflows around AI, 55 percent against 20 percent. Workflow redesign is not a platform purchase. It happens one workflow at a time, which is the same unit the two-week sprint operates on. The organizations that capture EBIT impact and the contract structure we are describing converge on the same grain of work.

Repetition compounds. For Rockwell Automation we built an AI-driven digital workplace that serves more than 28,000 employees across more than 80 countries; content findability improved 60 percent and support tickets fell 40 percent. For WS Audiology we delivered an employee experience portal with intelligent search for more than 10,000 employees across more than 30 countries, in eight weeks. Neither began as a moonshot. Both reached enterprise scale through bounded, verified increments.

Scale also has a regulatory dimension, and falsifiability helps there too. We build to SOC 2 and HIPAA standards where they apply. European deployments meet GDPR. Every implementation we ship is aligned with the EU AI Act. Because we build inside the client environment from day one, security and compliance teams review the real system on real infrastructure, not a vendor sandbox that will be re-architected later. An executable acceptance criterion and an audit requirement are the same kind of object: a written standard with a verifiable pass.

This is why the pilot paradox dissolves under these terms. Pilots multiply because they cannot end. Give every initiative a test it can fail, a clock that runs out in two weeks, and a payment that waits for the pass, and the portfolio sorts itself. What fails dies cheaply, on the builder's account. What passes is already in production, because it passed there.

What to do with this on Monday morning

1. Inventory every live AI pilot this week and record one fact for each: the executable test that decides whether it ships. Kill or convert any pilot that has no answer within 30 days.
2. Rewrite your next AI statement of work so every acceptance criterion names a held-out dataset, a numeric threshold, and the environment it runs in. Strike every adjective.
3. Tie at least one vendor payment this quarter to criteria passing in your own tenant, not to effort delivered. Expect the hard conversation about data quality to happen immediately; that is the point.
4. Scope production candidates as single workflows with a named owner and a measured baseline, never as platforms. Two weeks per workflow is the right clock.
5. Require every build to run in your environment from day one, so security and compliance review the real system rather than a sandbox replica.
6. Put a pilot ledger in front of the CFO monthly: spend to date, criterion, pass date or kill date. Treat any unfalsifiable line item as an unpriced liability.

Sources

The enterprise AI pilot paradox: Why pilots multiply while production stalls