In May, Gartner published its first-ever assessment of the enterprise AI coding agent market, formalizing a category that did not exist as a named market segment two years ago and now runs to roughly ten billion dollars a year. The message between the lines was clear: of all the places enterprises have poured AI money, software development is where the returns look most real. So here is the question every engineering leader should sit with. If coding is the one domain where AI value is most provable, why can almost nobody prove it?
Most organizations buy seats, watch a vendor dashboard tick upward, and conclude things are working. The dashboard shows suggestions made, suggestions accepted, an acceptance rate climbing past 30%. It feels like proof. It is not. Acceptance rate is the single most misleading number in the entire AI coding conversation, and the gap between what it measures and what actually matters is where engineering budgets quietly lose their justification.
Coding really is different
The optimistic case for AI coding tools is genuine, and it deserves a fair hearing before the skepticism arrives. GitHub’s own controlled study found developers completing a programming task 55% faster with an assistant than without. The market reflects that promise: AI coding tools now represent well over ten billion dollars in annual spend, and roughly 90% of the Fortune 100 have deployed GitHub Copilot in some form. Gartner’s decision to stand up a formal market assessment is itself a signal that coding has matured past experimentation into something boards expect to pay off.
That maturity is exactly why coding deserves better measurement than the rest of the AI portfolio, not worse. Gartner’s parallel research on AI in infrastructure and operations found that only 28% of those use cases fully succeed. Coding stands out as the exception, the place where the productivity story has the most evidence behind it. When you have found the one room with treasure in it, you do not measure your haul by counting how many times you opened a drawer.
But the dashboard is lying to you
The cleanest evidence that activity metrics mislead comes from a randomized controlled trial. METR studied experienced open-source developers working in codebases they knew well, and found they were 19% slower when using AI assistance. The detail that matters most for measurement: those same developers estimated they had been 20% faster. A nearly forty-point gap between perceived and actual productivity, in the population most enterprises are deploying these tools to. If your ROI case rests on developer self-report or on a feeling that the team is moving quicker, that is the gap you are standing on.
The quality picture is just as sobering. GitClear’s analysis of 211 million changed lines of code found that copy-pasted and duplicated code blocks rose eightfold in a single year, code churn climbed, and the share of lines devoted to refactoring fell to under 10%. AI makes it trivial to add code and does nothing to encourage consolidating it. Google’s 2025 DORA research found the same tension from a different angle: AI adoption correlated positively with throughput but negatively with delivery stability, meaning the tools that help you ship faster can quietly erode the controls that keep what you ship from breaking. Acceptance rate captures none of this. A developer can accept every suggestion and ship slower, buggier software, and the dashboard will call that a win.
Activity versus value: the real metric problem
The reason vendor dashboards surface acceptance rate, lines generated, and seat utilization is that these are the metrics the vendor controls and optimizes for. They describe how much the tool was used, not what the use produced. That distinction is the whole game, and it is the same vanity-versus-value problem we mapped for finance leaders in the metrics that actually matter. An engineering org running on acceptance rate is measuring the proxy and ignoring the signal.
The signal lives in a different set of numbers. How does cycle time differ between AI-assisted pull requests and the rest? What is your cost per merged PR once you divide total tool spend across providers by the work actually shipped? How has defect density moved since rollout, and which teams are driving the change? Which developers have genuinely adopted the tools, and which licenses are sitting idle at $19 to $50 a head every month? Answering those questions requires connecting pull-request data, provider cost data, and engineering outcomes in one place, which is precisely what Coding IQ was built to do. It measures the value of AI coding tools rather than the activity, because activity was never the thing the CFO was paying for.
What good measurement actually enables
This is not an argument that AI coding tools do not work. It is an argument that you cannot manage what you measure badly. The enterprises pulling real value from these tools are the ones that instrumented outcomes before scaling seats, the same discipline that separates winners across every category of AI investment in the broader ROI playbook. They can make decisions the acceptance-rate crowd cannot.
Consider the difference at a budget review. An engineering leader who can say that Cursor users close pull requests 28% faster than non-users at a cost of a few dollars per PR, while 40% of Copilot licenses sit unused, is making a business decision: scale the first, reclaim the second. A leader who can only report a 32% acceptance rate is reporting a vendor metric and hoping nobody asks what it bought. That is the position most VPs of engineering find themselves in, and it is an avoidable one. The instrumentation that closes the gap is the same vendor-neutral measurement layer that proves AI ROI across the rest of the stack, applied to the one domain where the returns are most worth proving. It is also the only honest way out of the trap NVIDIA documented when it found 30% of enterprises still cannot quantify AI ROI at all.
Coding is where enterprise AI ROI is most real. That makes it the worst possible place to keep measuring the wrong thing. Acceptance rate will tell you your developers are clicking accept. It will never tell you whether your software is better, faster, or cheaper to ship, which is the only question your board is actually asking.
Is your coding-tool spend producing value, or just activity? Talk to an expert to see how Olakai’s Coding IQ ties AI coding tools to cycle time, defect rate, and cost per pull request, so you can scale what works and cut what doesn’t.
