It’s (not) thinking?
LLMs’ “simulated reasoning” abilities are a “brittle mirage,” researchers find
Chain-of-thought AI “degrades significantly” when asked to generalize beyond training.
https://arstechnica.com/ai/2025/08/researchers-find-llms-are-bad-at-logical-inference-good-at-fluent-nonsense/
***********************************
In recent months, the AI industry has started moving toward so-called simulated reasoning models that use a “chain of thought” process to work through tricky problems in multiple logical steps. At the same time, recent research has cast doubt on whether those models have even a basic understanding of general logical concepts or an accurate grasp of their own “thought process.” Similar research shows that these “reasoning” models can often produce incoherent, logically unsound answers when questions include irrelevant clauses or deviate even slightly from common templates found in their training data.
In a recent pre-print paper, researchers from the University of Arizona summarize this existing work as “suggest[ing] that LLMs are not principled reasoners but rather sophisticated simulators of reasoning-like text.” To pull on that thread, the researchers created a carefully controlled LLM environment in an attempt to measure just how well chain-of-thought reasoning works when presented with “out of domain” logical problems that don’t match the specific logical patterns found in their training data.
The results suggest that the seemingly large performance leaps made by chain-of-thought models are “largely a brittle mirage” that “become[s] fragile and prone to failure even under moderate distribution shifts,” the researchers write. “Rather than demonstrating a true understanding of text, CoT reasoning under task transformations appears to reflect a replication of patterns learned during training.”
No one trained me for this!
To test an LLM’s generalized reasoning capability in an objective, measurable way, the researchers created a specially controlled LLM training environment called DataAlchemy. This setup creates small models trained on examples of two extremely simple text transformations—an ROT cypher and cyclical shifts—followed by additional training that demonstrates those two functions performed in various orders and combinations.
These simplified models were then tested using a variety of tasks, some of which precisely or closely matched the function patterns in the training data and others that required function compositions that were either partially or fully “out of domain” for the training data. For instance, a model trained on data showing two cyclical shifts might be asked to perform a novel transformation involving two ROT shifts (with basic training on what a single example of either shift looks like). The final answers and reasoning steps were compared to the desired answer using BLEU scores and Levenshtein Distance for an objective measure of their accuracy.
As the researchers hypothesized, these basic models started to fail catastrophically when asked to generalize novel sets of transformations that were not directly demonstrated in the training data. While the models would often try to generalize new logical rules based on similar patterns in the training data, this would quite often lead to the model laying out “correct reasoning paths, yet incorrect answer[s].” In other cases, the LLM would sometimes stumble onto correct answers paired with “unfaithful reasoning paths” that didn’t follow logically.
“Rather than demonstrating a true understanding of text, CoT reasoning under task transformations appears to reflect a replication of patterns learned during training,” the researchers write.
The researchers went on to test their controlled system using input text strings slightly shorter or longer than those found in the training data, or that required function chains of different lengths than those it was trained on. In both cases the accuracy of the results “deteriorates as the [length] discrepancy increases,” thus “indicating the failure of generalization” in the models. Small, unfamiliar-to-the-model discrepancies in the format of the test tasks (e.g., the introduction of letters or symbols not found in the training data) also caused performance to “degrade sharply” and “affect[ed] the correctness” of the model’s responses, the researchers found.
“A false aura of dependability”
Using supervised fine-tuning (SFT) to introduce even a small amount of relevant data to the training set can often lead to strong improvements in this kind of “out of domain” model performance. But the researchers say that this kind of “patch” for various logical tasks “should not be mistaken for achieving true generalization. ... Relying on SFT to fix every [out of domain] failure is an unsustainable and reactive strategy that fails to address the core issue: the model’s lack of abstract reasoning capability.”
Rather than showing the capability for generalized logical inference, these chain-of-thought models are “a sophisticated form of structured pattern matching” that “degrades significantly” when pushed even slightly outside of its training distribution, the researchers write. Further, the ability of these models to generate “fluent nonsense” creates “a false aura of dependability” that does not stand up to a careful audit.
As such, the researchers warn heavily against “equating [chain-of-thought]-style output with human thinking” especially in “high-stakes domains like medicine, finance, or legal analysis.” Current tests and benchmarks should prioritize tasks that fall outside of any training set to probe for these kinds of errors, while future models will need to move beyond “surface-level pattern recognition to exhibit deeper inferential competence,” they write.
LLM Hallucination Seems Like a Very Big Problem, Not a Mere Speedbump
and yet concerns about them are just handwaved away
https://freddiedeboer.substack.com/p/llm-hallucination-seems-like-a-very
****************************************************
.....sharing some of my ongoing experiments with inducing hallucinations in LLMs......But if you’re interested, look at these two interactions I had with cutting edge LLM models, ChatGPT’s GPT-5 and Gemini’s 2.5 Flash. I’ve found that I can easily get them to hallucinate by asking for quotes or citations related to highly-specific questions. Rather than report back that they haven’t found anything, they will simply hallucinate nonexistent sources; when the hallucination is pointed out, they’ll apologize, insist that the next source or quote they give me is verified and real, and hallucinate again. It’s funny, but also disturbing, because our economy currently relies on the AI bubble to avoid falling into a brutal recession.
GPT-5 source hallucinations (Illustration at link)
find the particular insistence that next time, the source will be real to be a good example of why this stuff is so potentially destructive. It’s not merely that these systems hallucinate, it’s that they radically overstate how trustworthy their outputs are to a public that has been so bathed in AI hype, many can’t help but naively assume that the computer is right about everything. OpenAI says that GPT-5 cuts down on hallucination problems, but a) I don’t trust Taco Bell when they say that the new quesarito is cheesarific and b) I don’t think synthetic benchmarks can tell us how common LLM hallucinations are, only the steady accumulation of millions of prompts and responses. And if you care to look, you can find dozens and dozens of embarrassing examples of GPT-5 hallucinations online already, even though it was just made publicly available last week.
Here’s Gemini. (Illustration at link)
You’ll note that there’s a weird combination of confession and pushback here - it really seems to want to insist that the source it gave is real, and sends back a slightly-modified but still hallucinated source on the second try. Then it comes up with a new, also-hallucinated source before finally giving up when the latest hallucination is pointed out to it. If you’d like a chuckle, check out where the link in the original hallucinated source actually goes.
People tend to act very blasé about this sort of thing and say that any LLM user should be skeptical about everything, verify everything, etc. But it’s simply the case that a ton of normies out there take everything an LLM tells them at face value; go do a search on Twitter or Reddit and you’ll find that to be true. And I think these repeated assurances that these sources are real and verified, when they in fact are hallucinations, should disturb you for that reason. My readers are savvy enough to know that an LLM saying that a quote comes from a “real, verifiable critique” or “I can confirm with a very high degree of certainty that this is a real and verifiable source” doesn’t mean anything. But ChatGPT has nearly a billion users. How many of them, do you think, have a healthy level of skepticism for everything these systems are saying?
Whenever I point out this sort of thing, LLM defenders pop up and say things like “You just have to verify everything! You just have to write your prompts very carefully!” To which I would say, that makes this whole endeavor vastly less useful and valuable, doesn’t it? If you have to have human verification for everything they do, you’re eliminating a vast portion of their comparative advantage; the whole point is to eliminate the human effort! And similarly, if you have to be some sort of prompt wizard to get reliable outputs from these systems, they become far, far less useful. Most people are not and will never be skilled at writing AI prompts. The whole idea was that these systems used natural language and could adapt to meet the user! Specialty tools for a small cadre of trained professionals are just a vastly different case than the promise of artificial intelligence that knows what the user wants better than the user does - socially, scientifically, communicatively, and especially financially.
For the record, this sort of thing demonstrates why human-like intelligence should still be a central goal. In this podcast episode, Kevin Roose and Casey Newton dismiss people who point out that LLMs do not reason or think in anything like the conventional understanding of reasoning or thinking. (They do a lot of dismissing in defense of LLMs, those dudes.) But this sort of thing is a perfect example of why conventional definitions of cognition matter. A lot of people find this sort of repeated error hard to grasp; if ChatGPT produced the hallucinated source, why is it also capable of immediately telling that the source is fake once I prompt it? Why did the system that could tell if a source is fake give me a fake source in the first place? Why are they “smarter” on the second prompts than on the first, smart enough to identify their own previous answer?
Well the answer is because LLMs do not think. They do not reason. They are not conscious. There is no being there to notice this problem. LLMs are fundamentally extremely sophisticated next-character prediction engines. They use their immense datasets and their billions of parameters to do one thing: generate outputs that are statistically/algorithmically likely to be perceived to satisfy the input provided by the user. And so ChatGPT and Gemini went looking for something, found that it was inaccessible to them, and invented responses that look like satisfactory responses, even though they weren’t. They could identify that these sources were hallucinated because I prompted them to. But there is no thinking mind that could notice that a source they named wasn’t real. There is no thinking mind in an LLM at all. Ask the LLMs and they’ll tell you! One of the few things I like about them is that they tend to be upfront about the fact that they’re Chinese rooms, just algorithms generating strings that seem likely to be what the user wants to see, mining existing text for patterns and associations and then building response strings that are algorithmically similar. That’s what they were designed to be and do. But now, we’re told, they’re more important than electricity and fire, more important than the Industrial Revolution.
Until and unless boosters like Newton and Roose and the rest of the media really grapple with these profound limitations, we’re at the mercy of a really dangerous hype cycle and stock market bubble. And no amount of handwaving away hallucinations or simply asserting that more compute will fix them is going to solve this vexing issue. The stakes are high. People won’t stop saying that ChatGPT is imminently going to replace your doctor. Well: do you want systems that are this confidently wrong to be prescribing medicine for your kids?