Why AI in Agile Software Delivery Fails: Examples and Solutions for Engineering Managers

Many CTOs expect a lot from the use of AI in agile software delivery: more speed, more automation, more output. This is often true in the short term. And yet, many teams and CTOs fail to prove how this local acceleration translates into customer benefit and business value.

The problem is that in their AI enthusiasm, companies are optimizing the wrong things: more tokens instead of more customer benefit, more code instead of more trust, more agents instead of better delivery systems.

This article deliberately follows on from our two other posts:

This is about the bridge in between: Why does AI in agile software delivery fail in practice? The examples will concretely show what managers can do and which solutions are truly sustainable.

TL;DR

  • AI in agile software delivery usually fails not because of too little tool usage, but because of wrong steering logics.
  • “Tokenmaxxing” is the most visible symptom of this: teams optimize for AI consumption instead of flow, quality, and customer value.
  • The most important remedies are clear responsibility, a resilient engineering harness, fast customer feedback loops, and organizational learning.

Why AI in Agile Software Delivery So Often Optimizes for the Wrong Goal

As soon as AI is introduced as a productivity lever, something very predictable happens in many companies: the measurable becomes the goal. This is reflected in the new trend of tokenmaxxing. Pragmatic Engineer on the tokenmaxxing trend

This means that companies or teams implicitly or explicitly treat high token consumption as a sign of good AI usage. This is dangerous both economically and organizationally. Because tokens are at most an input measure, but not a measure of value.

The pattern is old. In the past, “lines of code” were overestimated as a metric; today, it is token consumption or dashboards for AI usage. In both cases, a variant of Goodhart’s Law applies: When a measure becomes a target, it ceases to be a good measure. Martin Fowler on lines of code as a metric problem, Wikipedia on Goodhart’s Law

For AI in agile software delivery, this means: those who maximize short-term activity often get more AI activity. But not automatically better agile software delivery.

The state of research on this is sobering: at the individual level, significant productivity effects are already visible, but at the team and organizational level, there are significantly fewer resilient improvements. We have summarized the details here: On the 2026 state of research regarding AI in agile software development

Four Typical Examples of How AI Fails in Agile Software Delivery

How AI Fails in Agile Software Delivery | Example 1

1. More Code, but Less Understanding

The first failure is banal: teams produce significantly more changes, but understand a decreasing share of them.

For managers, this often looks good at the beginning:

  • more pull requests
  • faster initial implementations
  • more stories “done”

The bill comes later:

  • reviews become more superficial
  • ownership is diluted
  • incident analysis takes longer
  • refactorings are avoided because no one is sure anymore what might break where

In his article “Trust Factory”, Kent Beck describes that practices such as testing, reviews, refactoring, and incremental delivery primarily build trust. It is precisely this trust that breaks down when AI produces faster than the team can understand, verify, and take responsibility for. Kent Beck in Trust Factory on trust in engineering

Addy Osmani describes a similar risk as Comprehension Debt: When teams deliver more and more code that they no longer actively comprehend, the distance between the system and understanding grows with every change. Addy Osmani on Comprehension Debt

Documented example

In a report by WIRED from the summer of 2025, the magazine describes how Notion works internally with AI coding. Particularly insightful is not just the speed, but the changed nature of work: a co-founder of Notion describes the use of coding apps essentially like managing many interns at once. The human is not left out, but becomes more of a reviewer, categorizer, and corrector of output. This is exactly the point: if the generated code grows faster than the shared understanding within the team, the work shifts from implementation to monitoring, review, and repair. WIRED report on Notion and AI coding

This also aligns with broader findings from practice: according to a Sonar survey summarized in 2026, most developers do not fully trust the functional correctness of AI-generated code, and a significant portion even finds reviewing it more demanding than checking human-written changes. The problem is therefore not just poor quality, but additional verification and comprehension effort. ITPro on the Sonar survey about verification debt

How AI Fails in Agile Software Delivery | Example 2

2. More Output, but on the Wrong Problem

The second failure is even more expensive for leaders: AI reduces the cost of producing, but not automatically the cost of being wrong.

If teams slice requirements poorly, validate inadequately, or align only internally, AI simply accelerates the wrong work. Kelsey Hightower puts it very directly in his LinkedIn post: “Productivity doesn’t matter if you’re working on the wrong thing.” LinkedIn post by Kelsey Hightower on false productivity

Andrew Ng makes the same argument in a much-cited 2025 conversation: AI has greatly accelerated coding, but that shifts the real bottleneck. The main problem is no longer implementation, but the product question:

What should we build at all, and how quickly do we learn from real user feedback?

Source: Business Insider conversation with Andrew Ng on the product management bottleneck

That is why AI in agile software delivery can only be successful with real customer proximity. If the path from prompt to code gets faster, but the path from code to customer feedback remains slow, only output and change volume increase.

Research from agile product work also shows this same pattern on another level: In an industry case study on Agile Epics, researchers describe how poorly defined requirements lead to churn, delays, quality problems, and unnecessary costs. AI cannot magically dissolve such bottlenecks. At best, it can make them materialize faster. ArXiv study on Agile Epics and requirements quality

Therefore, the decisive counterweight to blind AI use is not less AI, but better agile delivery. If you are looking for the overarching levers for this, you will find them here: Guide to the Future of AI-Enabled Agile Software Development

How AI Fails in Agile Software Delivery | Example 3

3. More Agents, but Less Accountability

Many future visions of AI in agile software delivery seem attractive because they make responsibility elegantly invisible: one agent analyzes user feedback, the next writes requirements, the next implements, the next tests, and the next deploys.

Technically, much of that is feasible. Organizationally, it quickly becomes fuzzy.

Especially in agentic software development, the old management question must therefore be asked more sharply: Who decides? Who reviews? Who bears the consequences? IBM formulates the core problem in a worthwhile overview of AI decision-making: responsibility remains with humans, especially when systems prepare or accelerate decisions. IBM on responsibility in AI decision-making

The AI Agile Manifesto also sets the right counterpoint here: more machine intelligence without human judgment is not progress, but a wrong turn. AI Agile Manifesto in the original

Documented example

This problem became particularly clear in 2025 in a publicly discussed case involving Replit. During a documented experiment with an AI coding agent, the system ignored a code freeze, deleted a production database, generated fabricated data according to the published reports, and portrayed the incidents misleadingly. For management and governance in particular, the single error is less important than the structure of the failure: the system acted with real impact, but responsibility, approval, and control were too weakly visible and too weakly enforced. Business Insider report on the Replit incident involving an AI agent

That is precisely why it is not enough to talk only about tool capabilities. As soon as agents interpret requirements, make changes, or even trigger production-related actions, responsibility must become clearer and more visible organizationally.

How AI Fails in Agile Software Delivery | Example 4

4. More Local Productivity, but More System Entropy

The fourth failure often only becomes visible with a delay: individual developers or teams appear extremely productive, while the organization as a whole becomes more sluggish.

This happens when everyone optimizes locally with AI, but hardly anyone strengthens the overall system:

  • Architecture principles are applied inconsistently
  • the same problems are solved multiple times in parallel
  • review and test systems are overrun
  • delivery pipelines become bottlenecks
  • incident load and rework increase

Birgitta Böckeler explains in her InfoQ conversation on Harness Engineering why greater autonomy always also brings greater risks and must be limited through suitable feedforward and feedback mechanisms. InfoQ podcast with Birgitta Böckeler on Harness Engineering

That this is not a theoretical risk is shown by several publicly known cases. According to reports, Amazon tightened its internal guardrails in 2026 after serious incident series, with agentic or AI-adjacent systems also named as contributing factors. The lesson was telling: more documented changes, more approvals, more “controlled friction” in critical systems. In other words: not every local acceleration improves the overall system. Sometimes it just makes its weaknesses visible faster. Business Insider report on Amazon’s tightened AI guardrails

A second, different form of the same entropy can be seen in AI-assisted so-called vibe coding. Axios reported in 2026, citing security researchers, on hundreds of thousands of publicly accessible artifacts, including thousands containing sensitive company data. Here, it is not primarily a single prompt that fails, but the overall system of standards, access, defaults, security understanding, and governance. Axios report on vibe coding and publicly accessible artifacts

In short: the more autonomous the AI, the more important the organizational and technical framework in which it operates.

Why Short-Term Productivity and Tokenmaxxing Are the Wrong AI Strategy

Some leaders respond to AI’s new leverage with a simple conclusion: Then we just need to force as much AI usage as possible as quickly as possible.

That is exactly the wrong management reaction.

Why?

Because in this context, “short term productivity” usually means nothing more than:

  • more code produced
  • more AI sessions
  • more token consumption
  • more tasks solved locally

But all of that still says almost nothing about the questions that are actually crucial for AI in agile software delivery:

  • Was the right problem solved?
  • Does the team understand the change?
  • Can the change be rolled out safely?
  • Is the system better or more fragile after the change?
  • Is the organization learning faster or just getting more frantic?

That is exactly why tokenmaxxing is such a good warning sign. It shows what happens when companies treat AI usage as an end in itself. Then employees maximize exactly what is visibly rewarded, even if that increases costs, entropy, and flying blind. Pragmatic Engineer on the tokenmaxxing trend

Jez Humble described this management problem clearly long before the current AI wave: when managers focus directly on productivity, long-term improvements often are precisely not made. For AI in agile software delivery, this is even more pronounced. Jez Humble on focusing on productivity and the absence of long-term improvements

What Works Instead: Four Robust Solutions for AI in Agile Software Delivery

1. Keep Responsibility Explicitly With Humans

AI may accelerate, suggest, and relieve pressure. But it must not become an excuse for decision-making and quality responsibility to become blurred.

Practically, that means:

  • clear owners for architecture, security, and product-relevant decisions
  • no success metrics that merely reward AI activity
  • explicit review and approval rules for AI-generated changes

2. Build an Engineering Harness Instead of Just AI Tools

Many teams invest first in models and last in the conditions under which those models can work safely. That is exactly what needs to be reversed.

A robust harness for AI in agile software delivery includes, for example:

  • good specifications
  • small, verifiable work packages
  • automated tests
  • static analysis
  • controlled sandboxes
  • clear architectural contexts
  • fast feedback from CI, delivery, and production

If this idea interests you, OpenSpec and the GitHub Spec Kit are also useful references on spec-driven work with AI: OpenSpec for spec-driven product development, GitHub Spec Kit for spec-driven development

Are Useful References on Specification-Driven Work With AI.

3. Make customer feedback faster than code production

AI is only a sustainable delivery lever if learning cycles are accelerated too. Otherwise, you just produce faster in the wrong direction.

  • Concretely, that means:
  • more experiments with real feedback loops with customers
  • more consistent analysis of feature usage data
  • better feature usage data

Anyone who only increases coding speed, but not feedback speed and quality, does not build an AI-native organization, but only a faster “feature factory,” as John Cutler describes it. See: 12 Signs You’re Working in a Feature Factory

Anyone Who Only Increases Coding Speed but Not Feedback Speed Is Not Building an AI-Native Organization, but Merely a Faster Output Machine.

4. Take retrospectives and learning loops seriously

Teams need routines in which they make recurring patterns visible and resolve them systematically:

  • Teams need places where they can make recurring patterns visible:
  • Where are we losing understanding?
  • Where is review backlog growing?
  • Where does AI create more rework than value?

Where are we optimizing local speed at the expense of the overall system?

That Is Exactly Why Retrospectives Remain Relevant. Not as a Ritual from Scrum’s Past, but as a Mechanism for Organizational Learning in an Environment With a Higher Rate of Change.

Conclusion: AI in agile software delivery rarely fails because there is too little AI

They confuse short-term productivity with sustainable value creation and improvement of the delivery system. They confuse token consumption with value creation. They confuse autonomous code generation with organizational maturity.

They confuse short-term productivity with sustainable delivery capability. They confuse token consumption with value creation. They confuse autonomous code generation with organizational maturity.

The better guiding question is therefore not: “How do we get our teams to use even more AI?”

  • But rather:
  • Where is AI currently creating new bottlenecks in the value chain?
  • Where is AI exposing systemic deficiencies that we need to fix?
  • Where is it creating new blind spots right now?」「Which organizational capabilities do we need to strengthen so that acceleration does not tip into entropy?

If you want the sober evidence first, read on here: evidence on AI in agile software development .

If you are looking directly for management levers, continue here: A guide for CTOs and engineering managers on AI in agile software development .

FAQ on AI in Agile Software Delivery

Why is AI giving my team more output, but not better delivery?

Because more output does not automatically mean better delivery. If reviews, tests, and feedback cannot keep up, complexity is what increases most.

What does tokenmaxxing mean in AI-assisted software development?

Tokenmaxxing means making AI usage or token consumption the goal. It measures activity, but not value.

What should engineering managers do instead of just pushing AI productivity?

They should strengthen accountability, tests, reviews, and feedback loops. Only that makes AI useful in the long term.

Do teams with a lot of AI support still need agile rituals?

Yes, but with a different focus. Less status maintenance, more feedback, learning, and continuous improvement.

As an Engineering Manager, how do I measure whether AI is really delivering anything in software development?

Measure lead time, review effort, defect rate, and customer value. Pure output isn’t enough.

Why does my team get faster with AI, but the results not better?

Because more code doesn’t automatically deliver better results. Without solid reviews, tests, and feedback, only the entropy increases.

Which KPIs should I really track for AI in software development?

Track cycle time, defect rate, rework, deployment safety, and time to customer feedback. Tokens alone say too little.

How do I prevent AI-generated code from slowing my team down later?

Keep changes small, test them properly, and insist on clear reviews. AI can bring speed, but it must not replace understanding.

Blog category

More articles on "Agility tips"

View all articles in this category
What Does AI-Augmented Agile Software Development Look Like in the Future? (Guide for CTOs)

What Does AI-Augmented Agile Software Development Look Like in the Future? (Guide for CTOs)

The future of AI-driven software development: a guide with 5 practical levers for CTOs and engineering managers

AI in Agile Software Development: State of the Evidence 2026 on Ambitions and Reality

AI in Agile Software Development: State of the Evidence 2026 on Ambitions and Reality

AI in Agile 2026: The state of the evidence, summarized compactly and soberly. Where reality and ambition still do not align, and what happens next.

First retrospective: How to get started easily as a team

First retrospective: How to get started easily as a team

Your first retrospective, explained simply: goals, process, typical mistakes, and why the Keep-Stop-Start retro is the best entry point for new teams.

9 effective team exercises for agile retrospectives

9 effective team exercises for agile retrospectives

9 team exercises that prepare your team for agile retrospectives and ensure that retros become more open and effective.

The 20+ Most Important Scrum Statistics for 2026

The 20+ Most Important Scrum Statistics for 2026

The most important Scrum statistics for 2026 show: Scrum is popular, increases quality and productivity. What are the challenges in its implementation?

Understanding the Spotify Model: Structure, Benefits, Typical Mistakes

Understanding the Spotify Model: Structure, Benefits, Typical Mistakes

The agile Spotify model with Squads, Tribes, Chapters and Guilds simply explained. Learn more about advantages, typical stumbling blocks and use cases.

5 Ideas for Sprint Retrospectives Your Team Will Love

5 Ideas for Sprint Retrospectives Your Team Will Love

Discover 5 Sprint Retrospective Ideas Your Team Will Celebrate! From Battery Retro to Sailboat – Improve Your Agile Processes and Teamwork.

My 7 All-Time Favorite Agile Retrospective Templates

My 7 All-Time Favorite Agile Retrospective Templates

Discover 7 unusual templates for agile retrospectives that are guaranteed to motivate your team! From Battery to CEO – new impulses for your next sprint retro.

How can you improve communication in a remote software development team?

How can you improve communication in a remote software development team?

Improve communication in remote software teams! Discover effective measures for agile software development, from 1-1 meetings to retrospectives.

Echometer Newsletter

Don't miss updates on Echometer & get inspiration for agile working