Apple Study Highlights Major Limitations in Advanced AI Models
Recent research from Apple has shed light on significant challenges faced by advanced Artificial Intelligence systems, raising crucial questions about the ongoing efforts in the tech industry to create increasingly sophisticated models. The findings presented in a paper released over the weekend reveal that large reasoning models (LRMs) experience a “complete accuracy collapse” when confronted with highly intricate problems.
Apple’s research indicates that while standard AI models excel in tasks with lower complexity, they do not hold a candle to LRMs in terms of solving simple challenges. However, the study found that when it comes to complex tasks, both types of models encounter dramatic failures in their performance. LRMs, which are designed to tackle difficult queries through systematic thinking processes that deconstruct problems into manageable steps, exhibited a notable decline in reasoning ability as task complexity increased.
In its examination of various puzzle-solving tasks, the study reported that as LRMs approached their performance limitations, there was a tendency for them to “reduce their reasoning effort,” a phenomenon the researchers deemed alarming. This aspect of their findings has garnered attention from academic voices within the AI community, including Gary Marcus, a prominent critic of overestimating AI capabilities. Marcus characterized the Apple research as “pretty devastating,” expressing skepticism about the notion that large language models (LLMs)—the foundation for tools like ChatGPT—might offer a straightforward path to artificial general intelligence (AGI) capable of benefitting society.
Furthermore, the paper highlighted inefficiencies in how reasoning models utilize computational resources. For simpler problems, they often identified the correct solution quickly. However, as tasks grew slightly more challenging, the models initially pursued incorrect paths before eventually discovering the right answers. For high-complexity challenges, these models often reached a point of “collapse,” failing to produce any valid solutions at all. In one instance, even when provided with a suitable algorithm, the models did not manage to arrive at the correct answer.
The researchers noted, “As they approach a critical threshold related to accuracy collapse, models paradoxically begin to scale back their reasoning efforts, even as problem difficulty increases.” This intriguing observation suggests a profound scaling limitation within the cognitive capabilities of contemporary reasoning models.
In discussing the concept of “generalisable reasoning”—an AI’s ability to extend a specific conclusion to broader scenarios—the paper challenges established beliefs regarding LRM capabilities, indicating that current methodologies may be encountering foundational obstacles that inhibit generalisation.
Andrew Rogoyski from the Institute for People-Centred AI at
Leave a Reply