
When AI reasoning meets reality, the robot correctly applies the logical rule but mistakes the basketball for an orange. This is a perfect metaphor for how LLMs can simulate logical processes without possessing true understanding.
In recent months, the artificial intelligence community has been embroiled in a heated debate sparked by two influential research papers published by Apple. The first, “GSM-Symbolic” (October 2024), and the second, “The Illusion of Thinking” (June 2025), questioned the supposed reasoning abilities of Large Language Models, sparking mixed reactions across the industry.
As already analyzed in our previous in-depth article on “The illusion of progress: simulating general artificial intelligence without achieving it”, the question of artificial reasoning touches on the very heart of what we consider intelligence in machines.
What Apple's Research Says
Apple researchers conducted a systematic analysis of Large Reasoning Models (LRMs)—models that generate detailed reasoning traces before providing an answer. The results were surprising and, for many, alarming.
The Tests Conducted
The study subjected the most advanced models to classic algorithmic puzzles such as:
Tower of Hanoi: A mathematical puzzle first solved in 1957
River Crossing Problems: Logical puzzles with specific constraints
GSM-Symbolic Benchmark: Variations of elementary-level mathematical problems

Testing reasoning with classic riddles: the problem of the farmer, the wolf, the goat and the cabbage is one of the logic puzzles used in Apple studies to assess the reasoning abilities of LLMs. The difficulty lies in finding the correct sequence of crossings while preventing the wolf from eating the goat or the goat from eating the cabbage when left alone. A simple but effective test to distinguish between algorithmic understanding and pattern memorisation.
Controversial Results
The results showed that even small changes in the formulation of the problems lead to significant variations in performance, suggesting a worrying fragility in reasoning. As reported in AppleInsider's coverage, “the performance of all models declines when only the numerical values in the GSM-Symbolic benchmark questions are altered.”
The Counteroffensive: “The Illusion of the Illusion of Thinking”
The AI community's response was swift. Alex Lawsen of Open Philanthropy, in collaboration with Claude Opus of Anthropic, published a detailed rebuttal titled “The Illusion of the Illusion of Thinking,” challenging the methodologies and conclusions of the Apple study.
The Main Objections
Ignored Output Limits: Many failures attributed to “reasoning collapse” were actually due to the models' output token limits.
Incorrect Evaluation: The automated scripts classified even partially correct outputs as total failures.
Impossible Problems: Some puzzles were mathematically unsolvable, but the models were penalized for not solving them.
Confirmation Tests
When Lawsen repeated the tests using alternative methodologies—asking the models to generate recursive functions instead of listing all moves—the results were dramatically different. Models such as Claude, Gemini, and GPT correctly solved 15-disk Tower of Hanoi problems, well beyond the complexity where Apple reported zero successes.
Gary Marcus: The Longtime Critic
Gary Marcus, a longtime critic of LLM reasoning abilities, embraced Apple's findings as confirmation of his two-decade-old thesis. According to Marcus, LLMs continue to struggle with “distribution shift” — the ability to generalize beyond training data — remaining “good solvers of problems that have already been solved.”
The LocalLlama Community
The discussion has also spread to specialized communities such as LocalLlama on Reddit, where developers and researchers debate the practical implications for open-source models and local implementation.
Beyond the Controversy: What It Means for Businesses
Strategic Implications
This debate is not purely academic. It has direct implications for:
AI Deployment in Production: How much can we trust models for critical tasks?
R&D Investments: Where should we focus resources for the next breakthrough?
Communication with Stakeholders: How can we manage realistic expectations about AI capabilities?
The Neuro-Symbolic Approach
As highlighted in several technical insights, there is an increasingly clear need for hybrid approaches that combine:
Neural networks for pattern recognition and language understanding
Symbolic systems for algorithmic reasoning and formal logic
A trivial example: an AI assistant that helps with accounting. The language model understands when you ask, ‘How much did I spend on travel this month?’ and extracts the relevant parameters (category: travel, period: this month). But what about the SQL query that interrogates the database, calculates the sum, and verifies tax constraints? That is done by deterministic code, not the neural model.
Timing and Strategic Context
It has not escaped observers that Apple's paper was published shortly before WWDC, raising questions about strategic motivations. As noted in the analysis by 9to5Mac, “the timing of Apple's paper—just before WWDC—raised some eyebrows. Was this a research milestone, or a strategic move to reposition Apple in the broader AI landscape?”
Lessons for the Future
For Researchers
Experimental Design: The importance of distinguishing between architectural limitations and implementation constraints
Rigorous Evaluation: The need for sophisticated benchmarks that separate cognitive capabilities from practical constraints
Methodological Transparency: The obligation to fully document experimental setups and limitations
For Companies
Realistic Expectations: Recognizing current limitations without giving up on future potential
Hybrid Approaches: Investing in solutions that combine the strengths of different technologies
Continuous Evaluation: Implement testing systems that reflect real-world usage scenarios
The debate sparked by Apple's papers reminds us that we are still in the early stages of understanding artificial intelligence. As highlighted in our previous article, the distinction between simulation and authentic reasoning remains one of the most complex challenges of our time.
The real lesson is not whether LLMs can “reason” in the human sense of the word, but rather how we can build systems that leverage their strengths while compensating for their limitations. In a world where AI is already transforming entire industries, the question is no longer whether these tools are “intelligent,” but how to use them effectively and responsibly.
The future of enterprise AI will likely lie not in a single revolutionary approach, but in the intelligent orchestration of several complementary technologies. And in this scenario, the ability to critically and honestly assess the capabilities of our tools becomes a competitive advantage in itself.
For insights into your organization's AI strategy and the implementation of robust solutions, our team of experts is available for personalized consultations.
Sources and References:
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models - Apple Machine Learning Research
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models - Apple Machine Learning Research
Seven replies to the viral Apple reasoning paper - Gary Marcus
Apple's study proves that LLM-based AI models are flawed - AppleInsider
Welcome to Electe’s Newsletter - English
This newsletter explores the fascinating world of how companies are using AI to change the way they work. It shares interesting stories and discoveries about artificial intelligence in business - like how companies are using AI to make smarter decisions, what new AI tools are emerging, and how these changes affect our everyday lives.
You don't need to be a tech expert to enjoy it - it's written for anyone curious about how AI is shaping the future of business and work. Whether you're interested in learning about the latest AI breakthroughs, understanding how companies are becoming more innovative, or just want to stay informed about tech trends, this newsletter breaks it all down in an engaging, easy-to-understand way.
It's like having a friendly guide who keeps you in the loop about the most interesting developments in business technology, without getting too technical or complicated.
Subscribe to get full access to the newsletter and publication archives.