AI experts divided over Apple's research on large reasoning model accuracy

A token budget for large language models (LLM) refers to the practice of setting a limit on the number of tokens an LLM can use for a specific task

artificial intelligence, AI, COMPANIES
Apple’s observations in the paper perhaps can explain why the iPhone maker has been slow to embed AI across its products or operating systems
Avik Das Bengaluru
4 min read Last Updated : Jun 15 2025 | 10:42 PM IST
A recent study by tech giant Apple claiming that the accuracy of frontier large reasoning models (LRMs) declines as task complexity increases, and eventually collapses altogether, has led to differing views among experts in the artificial intelligence (AI) world.
 
The paper titled ‘The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity’ was  published by Apple last week.
 
Apple, in its paper, said it conducted experiments across diverse puzzles which show that such LRMs face a complete accuracy collapse beyond certain complexities. While their reasoning efforts increase with the complexity of a problem till a point, it then declines despite having an adequate token budget.
 
A token budget for large language models (LLM) refers to the practice of setting a limit on the number of tokens an LLM can use for a specific task.
 
The paper is co-authored by Samy Bengio, senior director, AI and ML research at Apple who is also the brother of Yoshua Bengio, often referred to as the godfather of AI.
 
Meanwhile, AI company Anthropic, backed by Amazon, countered Apple’s claims in a separate paper, saying that the “findings primarily reflect experimental design limitations rather than fundamental reasoning failures.”
 
“Their central finding has significant implications for AI reasoning research. However, our analysis reveals that these apparent failures stem from experimental design choices rather than inherent model limitations,” it said.
 
Mayank Gupta, founder of Swift Anytime, currently building an AI product on stealth, told Business Standard that both sides have equally important points.
 
“What this tells me is that we’re still figuring out how to measure reasoning in LRMs the right way. The models are improving rapidly, but our evaluation tools haven’t caught up. We need tools that separate how well an LRM reasons from how well it generates output and that’s where the real breakthrough lies,” he said. 
 
Gary Marcus, a US academic, who has become a voice of caution on the capabilities of AI models, said in a best case scenario, these models can write python code, supplementing their own weaknesses with outside symbolic code, but even this is not reliable. “What this means for business and society is that you can’t simply drop o3 or Claude into some complex problem and expect it to work reliably,” he wrote in his blog, Marcus on AI.
 
The Apple researchers conducted experiments comparing thinking and non-thinking model pairs across controlled puzzle environments. “The most interesting regime is the third regime where problem complexity is higher and the performance of both models have collapsed to zero. Results show that while thinking models delay this collapse, they also ultimately encounter the same fundamental limitations as their non-thinking counterparts,” they wrote.
 
Apple’s observations in the paper perhaps can explain why the iPhone maker has been slow to embed AI across its products or operating systems, a point on which it was criticised at the Worldwide Developers Conference (WWDC) last week. This approach is opposite to the ones adopted by Microsoft-backed OpenAI, Meta, and Google, who are spending billions to build more sophisticated frontier models to solve more complex tasks.
 
However, there are other voices too who believe that Apple’s paper has its limitations.
 
Ethan Mollick, associate professor at the Wharton School who studies the effects of AI on work, entrepreneurship, and education, mentioned on X that while the limits of reasoning models are useful, it is premature to say that LLMs are hitting a wall.

One subscription. Two world-class reads.

Already subscribed? Log in

Subscribe to read the full story →
*Subscribe to Business Standard digital and get complimentary access to The New York Times

Smart Quarterly

₹900

3 Months

₹300/Month

SAVE 25%

Smart Essential

₹2,700

1 Year

₹225/Month

SAVE 46%
*Complimentary New York Times access for the 2nd year will be given after 12 months

Super Saver

₹3,900

2 Years

₹162/Month

Subscribe

Renews automatically, cancel anytime

Here’s what’s included in our digital subscription plans

Exclusive premium stories online

  • Over 30 premium stories daily, handpicked by our editors

Complimentary Access to The New York Times

  • News, Games, Cooking, Audio, Wirecutter & The Athletic

Business Standard Epaper

  • Digital replica of our daily newspaper — with options to read, save, and share

Curated Newsletters

  • Insights on markets, finance, politics, tech, and more delivered to your inbox

Market Analysis & Investment Insights

  • In-depth market analysis & insights with access to The Smart Investor

Archives

  • Repository of articles and publications dating back to 1997

Ad-free Reading

  • Uninterrupted reading experience with no advertisements

Seamless Access Across All Devices

  • Access Business Standard across devices — mobile, tablet, or PC, via web or app

Topics :Apple iPhoneArtificial intelligenceApple iPhones

Next Story