Code Fast, Ship Slow: What Your AI Coding Tools Are Hiding

AI coding tools make writing code feel dramatically faster while what actually ships barely moves. The fix isn't a better tool – it's measuring delivery

5 min readBy Matthew Stublefield
Colorful cargo ship headed out to see aerial

A team I worked with had its best month for merged pull requests in the company's history. The graph was beautiful – up and to the right, the kind of slide that ends up in a board deck with no further comment. So I asked a different question. Of all that merged code, how much had actually reached a customer?

The room got quiet.

That gap – between the chart that's going up and the thing the business actually paid for – is the whole story of AI in software right now. The tools are real. The speed at the keyboard is real. What's hiding behind it is the part nobody's instrumenting.

A June 2026 piece in BigDATAwire gave the pattern a name: the AI coding productivity paradox. Their one-line version is hard to improve on – "code generation got faster while delivery got harder." Around the same time, METR, a research nonprofit, surveyed 349 technical workers, including 87 software engineers, and found something quietly damning. People systematically overestimate how much time AI actually saves them. Not lie – overestimate. They feel faster than the work shows they are.

I should say where I'm standing. I've built a handful of small software tools this year with the same AI assistants everyone's arguing about. I'm not an engineer. I'm a product manager who got curious and kept going. So this isn't a "the tools are bad" essay. They're genuinely useful, and I'll keep using them. It's a "here's what the tools quietly don't show you" essay, which is a different and more useful thing.

The tool measures the tool

Think about what an AI coding assistant can actually count. Suggestions offered. Suggestions accepted. Tokens generated. Lines written. Completions per hour. All of that goes up when you adopt the tool, by design. That's the dashboard you get handed, and it lights up like a pinball machine.

Every one of those numbers measures activity at the keyboard. None of them measures value out the door.

This is an old trap wearing new clothes. We've spent years teaching teams that lines of code is a terrible measure of progress, that story points aren't velocity, that "busy" and "productive" are different words for a reason. Then a tool shows up that can generate code faster than anyone can read it, ships with its own flattering scoreboard, and we start quoting that scoreboard in standups. The tool grades its own homework, and the grade is always an A.

Where the time actually went

When I run a delivery diagnostic on a team, I'm not very interested in how fast anyone types. I'm interested in one gap: the distance between "someone started working on this" and "this is live and a customer can use it." On the team with the beautiful PR chart, that gap ran four to six weeks. Almost none of it was typing time.

It was review. It was testing. It was the third environment that didn't match the first two. It was the change that broke two services downstream because it shipped fast and got understood slow.

Now add AI to that picture and watch what happens. You generate more code, faster. Which means more code to review – often code the author didn't fully write and can't fully explain. More surface area to test. More that can quietly break something three steps away. The keyboard got quicker and the system got heavier. Local speed, global drag. You can absolutely make the first ten minutes of a task faster and the next ten days slower, and if the only thing you're watching is the first ten minutes, you'll swear the team got faster while the release calendar tells a different story.

That's not a knock on the engineers. It's a measurement failure. They're being scored on the wrong half of the work.

Measure the thing you actually sell

Here's the part that sounds boring and is, I'd argue, the whole game. You already know how to measure delivery. You've just been letting the tool vendors tell you what counts.

The durable scoreboard hasn't changed because AI showed up. Cycle time, measured honestly – from first commit to live in production, not from "coding started" to "PR opened." Change failure rate – how often a release breaks something. Time to restore when it does. Deployment frequency – how often real value actually reaches real people. These are the DORA-style outcomes teams have used for years, and they have one excellent property: an AI tool can't inflate them by being chatty. A token doesn't count until something ships and stays shipped.

Run AI adoption against those numbers for a quarter and you'll learn something true. Maybe cycle time genuinely dropped, in which case, wonderful, you have evidence instead of vibes. Maybe code volume climbed and cycle time held flat, which means the bottleneck was never the typing and you've been speeding up the one part of the pipeline that wasn't slow. Either answer is worth having. Neither one is on the dashboard the tool gave you.

The honest version of the pitch

If you lead an engineering org, the temptation right now is to roll out AI tools, watch the usage dashboards climb, and report a productivity win upward. The dashboards will cooperate. They're built to.

The harder, more honest move is to decide before you roll anything out what "faster" would have to mean, in delivery terms, for the adoption to count – and then measure exactly that. Not editor speed. Not tokens. Whether more value reached customers, more reliably, in less time. If yes, scale it with confidence. If no, you've saved yourself from scaling an expensive illusion, which is its own kind of win.

I keep coming back to that quiet room and the beautiful chart. The chart wasn't wrong. The team really did merge more code than ever. It just wasn't measuring the thing anyone outside the engineering org actually cared about.

Your AI tools will always tell you how fast they are. That's their job. Knowing whether anything shipped is yours.

If your pull-request graph looks incredible and your release cadence doesn't, that gap isn't a mystery – it's a diagnostic, and it's worth running on purpose. That's the kind of work I do at Fieldway; if it sounds like your team, I'm at matthew@fieldway.org.

Want help running a sharper practice?

Fieldway works with boutique advisory firms to operate the systems behind the work — from intake to deliverable. Start with a conversation.

See how Fieldway helps