Reasoning could be the holy grail of AI adoption

Technology

Is explainability really the key to making LLMs trustworthy?

Serendipity often plays a key role in inventions. When Scottish physician and microbiologist Alexander Fleming returned from holiday, he noticed mould on a Petri dish that killed surrounding Staphylococcus bacteria. After the discovery, he identified the mould spores that accidentally drifted into his laboratory as Penicillium notatum, the source of penicillin.

Similar happy accidents led to other inventions too, such as radioactivity, LSD or superglue. Large language models (LLMs), the latest leap in AI, were not meticulously engineered either. Rather, they emerged from unsupervised training on a vast body of unstructured internet data.

OpenAI’s GPT-4, for example, learned the patterns of human language without explicit instructions. But because of this, how an input leads to an output remains largely opaque. Moreover, LLM’s probabilistic nature can make the same prompt produce different, equally plausible answers.

GPT training had its “penicillin moment” too. Being language models, LLMs are notoriously poor at arithmetic. When two Open AI researchers tried to train the model to add numbers, they initially failed, as the model simply memorised numbers rather than learning the principle of addition.

Ready to give up, the researchers accidentally let the experiment run for much longer than planned. To their surprise, by the time they returned, the model had learned to add.

Although arithmetic is still not an LLM forte, the lesson stuck: with LLMs, extended training and a higher number of parameters and neurons can be expected to yield more accurate and reliable results.

When unreliability can undermine business value

Three years after ChatGPT’s release in November 2022 and following a rapid surge in the number of commercial users, enterprises are keen to harness its potential.

Publicly, the race between AI companies is framed as a sprint toward artificial general intelligence (AGI) – systems with human-level cognitive abilities. But for business, the real prize is trustworthy AI: high-performing systems that are transparent, reliable, safe and secure. Even the lowest recorded AI hallucination rate of 17 per cent is unacceptable for such systems, particularly when used in high-stakes industries such as healthcare or finance.

Also, the time spent by employees improving “workslop” – low-quality output knocked-off with the help of generative AI – can offset a considerable amount of the technology’s efficiency gains – according to joint research by HBR’s BetterUp Labs and Stanford Social Media Lab. In their survey of 1,150 US-based full-time employees across industries, 40 per cent reported having received workslop in the past month.

Respondents also reported spending an average of one hour and 56 minutes dealing with each instance of workslop.

Based on participants’ estimates of time spent and their self-reported salary, researchers found that these workslop incidents carry an invisible tax of $186 (£138) per month. So the time lost by workers reviewing, double-checking and correcting genAI outputs can – at least to some degree – defeat LLMs’ original purpose to accelerate workflows.

Explainable AI – the road to making genAI trustworthy

Inaccuracies in genAI output, often the result of hallucinations, however, are not the only barrier to wider enterprise adoption. Even if hallucination rates were non-existent, enterprises, especially highly regulated ones, would still require transparency.

Except for simple models whose internal logic can be directly understood, for MLs interpretability – or the complete understanding of the decision-making process – is beyond reach. This applies not only to complex neural networks but also other, non-neural ML models which also function as black boxes. In the case of these opaque systems, all humans can get is a post-hoc – or retrospective – glimpse into how a particular prediction was made.

Explainability tools such as SHAP and LIME were designed to help. SHAP, borrowing from game theory, estimates how much each input feature contributes to a prediction while treating each feature as a player and the prediction as the payout.

LIME creates a simple surrogate model to approximate the complex black box model around a single data point. This explainability model is described as local because, unlike SHAP, it can’t explain the model’s behaviour globally across all data but focuses on just one data point.

While both methods are model-agnostic, they both struggle when applied to the billions of parameters in LLMs.

Fittingly, the breakthrough that rendered not just a peak but a profound look into the neural networks of LLMs was a “brain scan” performed by former OpenAI scientists at Anthropic – an AI research company with a mission to build reliable, interpretable and steerable (jargon for controllable) AI systems.

By subjecting the company’s Claude Sonnet, a medium-strength model, to its innovative scanning technique, Anthropic could see what combination of artificial neurons evoked a certain concept in the LLM.

The most memorable run of the experiment was when it mapped out the set of neurons that were activated when Claude was “thinking” of the Golden Gate Bridge, including ones related to Alcatraz, California governor Gavin Newsom and Alfred Hitchcock’s movie Vertigo, set in San Francisco.

But what the researchers saw behind the scenes was double-edged. By adjusting neuron activity, they could not just suppress harmful features but also amplify them. In one alarming case, the experiments showed how the model could be manipulated into racist or self-destructive responses by dialling up features related to these behaviours.

Making ML models intrinsically interpretable

The obstacles to making powerful LLMs more transparent prompt a deeper question, though: can generative AI ever become fully explainable and trustworthy?

The answer of the proponents of symbolic AI – the dominant AI paradigm before the rise of machine learning – is an emphatic no.

American cognitive scientist Gary Marcus, a leading voice for the symbolic approach, argues that attempts to retrofit reasoning onto LLMs reveal the models’ fundamental limits.

The latest such attempt is the emergence of large reasoning models (LRMs). OpenAI’s o1 model, released in late 2024, used a “chain-of-thought” method to reason step by step instead of producing off-the-cuff answers.

At first glance, this seemed like a breakthrough. But closer inspection revealed that the new model is an unreliable witness to its own thought processes, and its reasoning is often fabricated.

Paradoxically, OpenAI also reported that its new reasoning models tend to hallucinate more than the preceding ones: the result of in-house benchmark tests was 33 per cent for o3 and 48 per cent for the o4 model.

The credibility of LRMs suffered a blow in June 2025 when Apple published The Illusion of Thinking, a study that pointed out how LRMs’ accuracy collapses when they face highly complex tasks, as well as that their apparent reasoning is in fact a sophisticated form of pattern-matching.

A hybrid way forward?

But what comes next if it turns out LLMs really aren’t capable of more than mimicking reasoning?

Symbolic AI proponents argue for a hybrid approach. Instead of rivalry between the probabilistic machine learning and the rule-based symbolic strands of AI, they envision an all-hands-on-deck effort.

Probabilistic models excel at pattern recognition and language fluency, while symbolic systems offer logic, structure and interpretability. Combining these strengths, they argue, may be the only path to AGI.

Whether this integration is to happen through a formal paradigm shift or the gradual integration of reasoning tools into LLMs remains to be seen.

But inventions also have their life-cycles. As the initial excitement about the new shiny tool subsides and actual and potential investors in the technology start asking about the return on their investment, AI must move on to the next phase where trustworthiness and reliability is front and centre.

After all, although Alexander Fleming discovered penicillin, it was a team of researchers at the University of Oxford who turned it into a drug and developed the necessary methods for its mass production.

Digital Transformation