Before Apple’s AI Went Haywire and Started Making Up Fake News, Its Engineers Warned of Deep Flaws With the Tech And they released it anyway.
News Flash, Buddy
Apple’s recent stab at AI, Apple Intelligence, has largely been disappointing. In particular, its news summaries faced suched widespread criticism for botching headlines and reporting false information that this week Apple paused the entire program until it can be fixed.
None of this should be surprising. Such AI “hallucinations” are a problem inherent to all large language models that nobody’s solved yet, if it’s even solvable at all. But releasing its own AI model sounds especially reckless when you consider that Apple engineers warned about the tech’s gaping deficiencies.
That warning came in a study released last October. The yet-to-be-peer-reviewed work, which tested the mathematical “reasoning” of some of the industry’s top LLMs, added to the consensus that AI models don’t actually reason.
“Instead,” the researchers concluded, “they attempt to replicate the reasoning steps observed in their training data.”
Math Is Hard
To test the AI models, the researchers had them attempt thousands of math problems from the widely used benchmark GSM8K dataset. A typical question is as follows: “James buys 5 packs of beef that are 4 pounds each. The price of beef is $5.50 per pound. How much did he pay?” Some questions are a tad more complicated, but it’s nothing that a well-educated middle schooler can’t solve.
The way the researchers exposed these gaps in the AI models was shockingly easy: they simply changed the numbers in the questions. This prevents data contamination — in other words, ensuring that the AIs haven’t seen any of these exact problems before in their training data, without actually making the problems any harder.
This alone caused a minor but notable drop in accuracy in every single of the 20 tested LLMs. But when the researchers took things a step further by also changing the names and adding in irrelevant details — like in a question about counting fruits, remarking that a handful of them were “smaller than usual” — the performance drop was, in the researchers’ own wording, “catastrophic”: as high as 65 percent.
These varied between models, but even the cleverest of the bunch, OpenAI’s o1-preview, plummeted by 17.5 percent. (Its predecessor GPT-4o, fell by 32 percent.)