Why good AI models go wrong

AI & Automation10 Apr 2026

Avi Cavale at AI development platform Quarterback argues that autonomous agents fail because they have amnesia

I’ve been watching the autonomous coding agent demos with a mixture of fascination and frustration.

The demos are impressive. The agent picks up a GitHub issue, reads the codebase, writes the code, runs the tests, opens a PR. Twenty minutes of work automated. The future of software engineering.

Then you try it on your codebase. And it goes off the rails. Not because the model is stupid — it’s clearly capable. But it makes decisions that are technically valid and organisationally wrong. It uses offset pagination when your team standardised on cursor-based. It introduces a dependency that caused a cascading failure last quarter. It takes the same approach that someone already tried and abandoned because it broke the mobile API.

The industry’s diagnosis: models aren’t capable enough yet. Wait for the next generation. More reasoning. More capability. More intelligence.

I think the diagnosis is wrong.

The talented contractor on day one

Here’s how I think about it: an autonomous agent is a talented contractor who shows up on their first day with no context about your organisation. They can code. They’re smart. They can reason about complex problems.

But they don’t know your conventions. They don’t know your prior decisions. They don’t know the bugs you’ve already found and fixed. They don’t know the approaches that have been tried and failed. They’re starting from the same zero that a fresh Claude Code session starts from — which is to say, they know nothing except what they can read from the files.

A human contractor in this situation would ask questions. "What pagination approach do you use?" "Has anyone worked on this before?" "Are there any known issues with this module?"

The autonomous agent can’t ask these questions because it doesn’t know what it doesn’t know. And more importantly, even if it could ask, there’s no organisational knowledge base for it to consult. The knowledge exists only in people’s heads and stale documents.

More reasoning doesn’t fix this

The industry’s response is to give models more reasoning capability. Chain of thought. Planning steps. Self-reflection. These help the model think more carefully about the code it can see.

But they don’t help it know things it has never been told.

No amount of reasoning will help an agent discover that your team has a specific convention for retry logic. No chain of thought will reveal that a previous attempt at this fix was abandoned for a specific reason. No self-reflection will surface the known error pattern that makes one approach dangerous.

Reasoning operates on available information. If the information isn’t in context, better reasoning just generates more confident wrong answers.

What agents really need

After spending a lot of time on this, I’ve landed on a pretty simple conclusion: autonomous agents need memory. Not "memory" in the LLM sense of a bigger context window. Actual organisational memory:

Rules that are enforced, not suggested. Not a config file someone might forget to update. Mandatory constraints injected into every autonomous session at the infrastructure level. "All API changes must be backwards-compatible." "Tests must cover the error path." These should be there whether or not the person who set up the automation remembered to include them.

Decisions that are respected. The team has already made choices. Cursor-based pagination. Event sourcing for audit. Idempotency keys for payments. An agent that doesn’t know these decisions will r-decide them — probably differently — and create inconsistency across the codebase.

Error patterns that are avoided. Your codebase has known pitfalls. The N+1 query problem in the ORM. The race condition in the queue consumer. An agent with this knowledge avoids them proactively. One without stumbles into them and wastes its execution budget debugging something that was already understood and solved.

History of prior work. What happened last time someone worked on this? Did a similar task succeed? Fail? Get abandoned? The agent should know before it starts, not discover halfway through that it’s repeating a failed approach.

The reliability gap isn’t about intelligence

This is the thing that keeps nagging me. The demos work because the demo tasks are self-contained. "Build a to-do app." "Add a dark mode toggle." These require no organisational context. The code is the entire story.

Production tasks are different. They exist in a web of prior decisions, team conventions, known constraints, and historical context. The code is maybe 30% of what you need to know. The other 70% is why the code is the way it is.

That’s why autonomous agent reliability drops off a cliff when you move from demos to real codebases. The model is capable of writing the code. It’s incapable of knowing the context that makes the code correct for your organisation.

And every model improvement — more parameters, better reasoning, higher capability — improves the 30% (code quality) without touching the 70% (organisational context). The gap between demo performance and production performance persists, regardless of how good the model gets.

The infrastructure nobody’s building

The autonomous agent market is focused on model capability, tool integration, and orchestration. Better models. Better tools. Better chains.

Almost nobody is building the knowledge layer that makes all three effective. And I think that’s the bottleneck. A capable model without organisational knowledge is a talented contractor on their first day. They can write code. They can’t write the right code for your organisation without the context that makes "right" meaningful.

The agent frameworks that will win aren’t the ones with the most tools or the most sophisticated reasoning. They’re the ones where the autonomous agent, on its hundredth task, is dramatically more reliable than on its first — because it has accumulated the organisational knowledge that makes its decisions correct.

That’s not a model problem. It’s a memory problem.

Avi Cavale is the founder of Quarterback, the AI development platform that learns how your team builds

Main image courtesy of iStockPhoto.com and napong rattanaraktiya

Why good AI models go wrong

Business Reporter Team

You may also like

A future of prosperity and opportunity in the Middle EastSPONSORED ARTICLE

Accelerating regulatory approval

Achieving growth when it seems impossible

Related Articles

OpenAI CEO Altman seeks dismissal of sister's punitive damages claims in sexual abuse lawsuit

India's Wipro unveils record buyback after slight quarterly revenue miss

AI and the new era of supply chain execution

Complying with the Cyber Security and Resilience Bill

Related Articles

The critical importance of entry-level roles

From Breach to Access

Most Viewed

App developers urge EU action on Apple fee practices

Agentic AI race by British banks raises new risks for regulator

Databricks builds war chest with $134 billion valuation in latest funding round

Exclusive-Nasdaq seeks to extend trading hours, as Wall Street gears up for 24/7 move

Nokia CEO compares AI surge to 1990s internet boom, but plays down bubble fears

Tesla plans $20 billion capital spending spree in push beyond human-driven cars

Netflix says its position on deal with Warner Bros Discovery unchanged

Meta subject to EU antitrust investigation over plan to block AI rivals in WhatsApp

Caterpillar flags $2.6 billion tariff hit in 2026, power equipment demand lifts quarter

China's polysilicon giants set up acquisition firm to tackle oversupply

Winston House, 3rd Floor, Units 306-309, 2-4 Dollis Park, London, N3 1HF

23-29 Hendon Lane, London, N3 1RT

020 8349 4363

info@business-reporter.co.uk