The 88% That Never Ship: Why Enterprise AI Agents Die in Pilot

Fieldway's take on the spring-2026 agent data cluster (Forrester/Anaconda/IDC via Digital Applied + GoGloby): 88% of agent pilots never reach production, and

June 16, 20266 min readBy Matthew Stublefield

Modern buildings with a circular structure connecting them.

Let me start with a definition, because the word "agent" has been stretched to mean almost anything this year. When people in enterprise software say "AI agent" in 2026, they mean a system that doesn't just answer a question but takes actions on its own – pulls data from a few places, makes a decision, kicks off a next step, and does it with some degree of independence. A chatbot answers. An agent acts. That distinction is the whole reason agents are exciting, and it's also the reason they're hard to ship.

Here's the state of play. Agents are everywhere – at least in the demo. Gartner found that 80% of enterprise applications shipped or updated in the first quarter of 2026 embed at least one AI agent, up from 33% in 2024. In two years, agents went from a niche experiment to a default feature. The question stopped being "can we build one." Almost everyone can build one now. The question became whether the thing you built survives contact with real users, real data, and real consequences.

Mostly, it doesn't. And the number that captures how badly is worth sitting with.

The number, and what it actually counts

Eighty-eight percent of enterprise AI-agent pilots never reach production. That figure comes from 2026 Forrester and Anaconda data compiled by Digital Applied, and it's easy to misread, so let me be precise about what it counts. A "pilot" is the trial run – the version a team builds to prove an agent can do a job before rolling it out for real. "Reaching production" means it graduates from that trial into actual daily use. Eighty-eight out of a hundred never make that jump. For every agent quietly doing real work somewhere, roughly eight others died on the way there.

They die in recognizable ways. A demo wows a steering committee and then never gets funded past the demo. A proof-of-concept works beautifully until it meets the messy edge cases of real data and quietly gets shelved. A rollout date slips, then slips again, until everyone stops asking about it. The agent didn't explode. It just never crossed the line, and after a while no one wanted to be the one to say so.

This isn't one outlier study. The same picture shows up from every direction. McKinsey's 2026 read found fewer than 10% of enterprises have scaled agents to deliver tangible value. An analysis of 340 enterprise deployments by the firm agenticonsult put the share with agents actually running in production with measurable outcomes at around 11%. Gartner expects more than 40% of agentic AI projects to be cancelled outright by 2027. Near-universal adoption, rare production, a wide graveyard in between. When this many independent datasets agree, the pattern is real.

The cause is not the thing you'll blame

When a pilot stalls, the reflex is to blame the model. It hallucinated. It wasn't smart enough. We should wait for the next version, which will surely fix it. That reflex is comforting because it points the finger at the vendor, and it's almost always wrong.

Look at what actually kills these projects. Forrester's root-cause analysis of the failures attributes 41% to unclear success criteria, 33% to insufficient tool or data access, and 26% to drift in evaluation coverage. Read that list slowly and notice what isn't on it: model quality. The agents aren't failing because the underlying model can't reason. They're failing because nobody wrote down what "working" would mean, nobody gave the agent the access it needed to do the job, and nobody kept checking whether it still worked after launch.

Unpack each one, because they're more ordinary than they sound. "Unclear success criteria" means the team built an agent without agreeing on how they'd know if it succeeded – so it could never definitively pass, and a thing that can't pass eventually gets quietly dropped. "Insufficient tool or data access" means the agent was asked to do a job but not given the keys to the systems the job requires, usually because the access was tangled in security or politics nobody resolved. "Evaluation drift" means the checks that confirmed the agent worked at launch stopped keeping up as the world changed around it. None of those are intelligence problems. They're organizational problems wearing a technical costume.

What the survivors have in common

If the 88% share causes of death, the surviving 12% share an operating profile – and it's far less impressive than you'd hope. No exotic architecture. No secret model. The agents that live tend to have three boring things in place. A named owner who actually controls the budget for the agent. A job scoped down to a single workflow with a binary test for success, instead of a vague open-ended "assistant" that's supposed to help with everything and is therefore accountable for nothing. And human checkpoints built into the first couple of months, so a person catches the agent's mistakes while trust is still being earned.

That's the whole differentiator. Not the model you chose – whether anyone could say, on day one, what the agent was for and how you'd know if it failed. The companies that win don't win the model fight. They win the scoping fight: the unglamorous, upstream work of deciding what the thing is supposed to do before turning it on. It's the least exciting part of the project and the one that determines whether the project exists in a year.

Evaluation coverage is the number that predicts survival

Here's the single most useful statistic I've seen on agents all year, and I'd put it on a wall if I ran an engineering org. First, the plain-language version of what "evaluation" means: it's an automated test that checks the agent's output is still good every time something changes – a new prompt, a new model version, a new tool. Think of it as a smoke detector for quality. Without it, you find out the agent got worse when a customer tells you.

Now the numbers. In Forrester's 2026 panel, agents without automated evaluations had a 47% production rollback rate over the prior year. Agents with full evaluation coverage had a 9% rollback rate. Same models. Often the same vendors. A roughly five-fold difference in whether the agent got yanked back out of production – explained almost entirely by whether anyone was checking its work automatically.

And almost nobody is. Only 38% of production agents run automated evaluations on every prompt, model, or tool change. So nearly two-thirds of the agents that did survive to production are running blind, one silent regression away from the rollback pile. That's the real story underneath the 88%. The pilots that never ship and the production agents that get rolled back are failing the same test: nobody built the instrument that tells you when the agent quietly got worse.

This is why "evaluation-first" isn't a slogan, it's a sequencing decision. The highest-performing teams build the evaluation harness before the agent logic. That sounds backwards until you realize the alternative is shipping something you have no way to measure and discovering its failures in production, on a customer.

What to do before the second pilot

If your first agent stalled, the worst response is to go shopping for a smarter model. The better response is to refuse to start the second pilot until you can answer four questions in plain language.

Who owns this agent, and do they hold the budget for it? What is the single workflow it does, and what is the binary test for "it worked"? What automated evaluation runs on every change, before that change reaches production? And where are the human checkpoints for the first 60 to 90 days, while it earns trust?

Notice that none of those are AI questions. They're delivery questions – the same ones that separated functional software teams from dysfunctional ones long before agents existed. Who owns it, what does done mean, how do we know it's still working, who's watching while it's new. The technology changed. The thing that determines whether it ships didn't.

The model is almost never why your agent died. It's the thing you'll blame, because blaming the model is cheaper than admitting nobody decided what the agent was for. The 12% that make it to production figured that out before they turned anything on. That's the whole moat, and it's available to anyone willing to do the boring part first.

Stuck on what to build next?

Trade months of debate for a direction you can defend – personas your designers can design for, problems your engineers can write epics against, a roadmap you build toward. Including what not to pursue, and why.

See how Product Strategy works

The 88% That Never Ship: Why Enterprise AI Agents Die in Pilot

The number, and what it actually counts

The cause is not the thing you'll blame

What the survivors have in common

Evaluation coverage is the number that predicts survival

What to do before the second pilot

More from Product Strategy

"Agentic Engineering" Hype Meets Delivery Reality

Run AI Adoption as a Measured Pilot, Not a Mandate

Code Fast, Ship Slow: What Your AI Coding Tools Are Hiding

Stuck on what to build next?