Run AI Adoption as a Measured Pilot, Not a Mandate

A mandate to "use AI" measures compliance. A pilot measures truth. How to roll out AI coding tools as a real experiment – with a hypothesis, a baseline, and a

4 min readBy Matthew Stublefield
Google Site overview analytics for a blog.

There's an email going around a lot of engineering orgs right now. Some version of: leadership has decided AI is a priority, every team is expected to adopt the tools, and usage will be tracked. Maybe there's a dashboard. Maybe there's a number you're supposed to hit.

I understand the impulse. The pressure from above is real, the tools are genuinely interesting, and "we're moving on this" is a much better thing to tell a board than "we're studying it." But a usage mandate is one of the most expensive ways to learn nothing, and it's worth being honest about why before you send that email.

A mandate measures compliance

The moment you mandate usage and track it, you've told everyone exactly what the goal is: usage. Not better software. Not faster delivery. Usage. And usage is trivially easy to produce whether or not the tool is helping.

People are not dumb. If the metric is "did you use the AI," they will use the AI – on the tasks where it's easy to use it, in the ways that show up in the dashboard, regardless of whether it actually made the work better. You'll get a beautiful adoption curve and a completely uninformative one. The number goes up. You still have no idea if anything improved.

This is the same trap I wrote about with the broader productivity paradox: when you measure the tool's activity instead of the work's outcome, the tool always looks good, because looking good at activity is what the measurement rewards. A mandate just bakes that trap in at the org level and adds social pressure on top.

I've built software with these tools myself this year – I'm a product manager, not an engineer, and they let me ship things I couldn't have otherwise. So I'm not anti-adoption. I'm anti-pretending. A mandate is how you adopt without learning. A pilot is how you learn.

A pilot measures truth

Here's the whole difference. A mandate asks "are people using it?" A pilot asks "does it actually make our delivery better, and how would we know?" The second question is harder, slower, and the only one worth answering.

A real pilot doesn't need to be elaborate. It needs four things most rollouts skip.

A hypothesis specific enough to be wrong. Not "AI will make us more productive." Something like "this assistant will cut cycle time on our routine CRUD and test-writing work by a meaningful amount, without raising our change-failure rate." Now you've said something falsifiable. You can be proven wrong, which means you can also be proven right, which is the entire point.

A baseline you measured before you started. This is the step everyone forgets and the one that makes the whole thing real. Take a representative team and, before they touch the new tool, measure the delivery outcomes that actually matter: cycle time from first commit to production, change-failure rate, rework, time to restore when something breaks. You cannot tell whether a tool helped if you never wrote down where you started.

A time box and a real team. One team, or a few, for a defined window – long enough to get past the honeymoon and the awkwardness, short enough that you're not betting the year on a hunch. Pick people doing real work, not a showcase squad assembled to make the tool look good.

A decision you've agreed to in advance. Before the pilot starts, write down what result scales it and what result kills it. If cycle time drops and quality holds, you scale with evidence. If code volume climbs and delivery doesn't move, you've learned the bottleneck was never the typing – and you stop, without anyone having to lose face, because the kill criteria were agreed up front.

What the pilot protects you from

Two failures, mostly.

It protects you from scaling an expensive illusion. Rolling a tool out to two hundred engineers on the strength of a vibe is how you end up a year later with a big bill, a flat release calendar, and no way to tell whether the tool was the problem because you never measured anything. The pilot is cheap. The org-wide mistake is not.

It also protects you from the opposite error – dismissing a tool that actually works because the first loud anecdote was negative. A measured pilot doesn't care about the anecdote. It cares about your delivery numbers on your work. Maybe the tool genuinely cuts your cycle time. If so, you'll have the evidence to push past the skeptics instead of arguing about feelings.

That's the quiet superpower of running adoption as a diagnostic instead of a directive: you replace the argument with a measurement. Nobody has to win the debate about whether AI is good or bad for engineering in the abstract. You just find out what it does to your team, on your work, against the numbers you already use to judge delivery.

The one-sentence version

If you're going to bet on AI in your engineering org – and most of you should at least find out – bet like someone who wants the truth, not like someone who wants the credit. Run it as a pilot with a hypothesis, a baseline, a time box, and a decision you made before you got attached to the answer.

The mandate gets you a dashboard. The pilot gets you the truth. Only one of those is worth the rollout.

If you'd rather design that pilot with someone who measures delivery for a living than guess at it, that's the work I do at Fieldway – matthew@fieldway.org.

Want help running a sharper practice?

Fieldway works with boutique advisory firms to operate the systems behind the work — from intake to deliverable. Start with a conversation.

See how Fieldway helps