Photo by Giuseppe CUZZOCREA on Unsplash
AI is very good at appearing smart, and will often give you the right answer! But once it can't copy someone's homework word-for-word, it struggles — big time.
A couple months ago, I penned a piece accusing Apple of promoting its AI in a way that either encourages people to lie and cheat or that will make them stupid. To sum it up as briefly as possible, I wrote:
Not only does [Apple Intelligence] dismiss any need to actually know something, but it encourages its users to pretentiously act as if they do… In effect, Apple is almost explicitly saying that with its new product, one need not bother to put any effort into anything Apple Intelligence can do for them.
My warning was well founded. As it turns out, Apple Intelligence—as it is clumsily called—is not so intelligent after all. So much for Bella’s smugness. In fact, Apple’s LLM cannot perform a very basic linguistic function: summarize the news accurately. And this is not a problem exclusive to Apple. Virtually all AI, and LLMs perhaps worst of all, suffer from an intelligence problem.
Pulling the service
On January 16, various media outlets reported that Apple was pulling its AI service that summarizes news stories and headlines. Geoffrey Fowler, a tech columnist with the Washington Post, repeatedly pointed out errors in the AI’s output on his Bluesky account. He wrote:
More Apple Intelligence nonsense this morning. This time only 50% wrong.
That was a day after he posted:
It's wildly irresponsible that Apple doesn't turn off summaries for news apps until it gets a bit better at this AI thing.
In both cases, he provided screenshotted illustrations. The BBC and other news agencies also complained, ultimately leading to the suspension of the service. In classic corporate speak, a media person for Apple only said, “We are working on improvements and will make them available in a future software update.”
Disappointment in the product is not a phenomenon of the last week, however. In December, Frank Landymore at Futurism called it “a total dud with buyers.” He noted that 73 percent of the people who responded to a survey conducted by online smartphone marketplace SellCell called the service “not very valuable” or that it “add[ed] little to no value.” Even then, Apple’s AI had already made the news for generating a completely made-up headline about the alleged shooter of the United Healthcare CEO.
It’s simple—AI cannot reason, so it screws up very basic things
No matter how much venture capitalists want to convince the public that AI is one software tweak away from going full Skynet, the fact is that it cannot reason in any meaningful way, at least not in the way nonscientists tend to contemplate the concept. A perhaps more accurate way to think of what LLMs do is calculate.
If you do an internet search for “Can AI reason?” you will find myriad answers. The debate rapidly becomes a philosophic or semantic one. Often, defenders of poorly performing LLMs will point to the quality of the training set as the culprit rather than to the nature of whatever constitutes reasoning.
The problem with that argument, however, is that AI—LLMs, in particular—must train on extraordinarily large datasets to adequately function. But, by doing so, LLMs inevitably ‘learn’ from practically everything out there on the internet, whether nonsense or not. (And you don’t need me to tell you about how much of the internet is nonsense).
It is worse than that, though. In October of 2024, researchers illustrated with a relatively simple experiment that the inability to contextualize is a key deficiency in LLMs’ ability to ‘reason.’ They set LLMs to work on basic math problems that, even though they are not geared specifically toward arithmetic, LLMs should be able to solve.
The LLMs typically gave the right answer to basic challenges. When the researchers added a tiny bit of extraneous information to the problem presented, however, the LLMs routinely failed.
Here is how they described the results:
Overall, we find that models tend to convert statements to operations without truly understanding their meaning. For instance, a common case we observe is that models interpret statements about “discount” as “multiplication”, regardless of the context. This raises the question of whether these models have truly understood the mathematical concepts well enough. Consequently… there is a catastrophic performance decline across all tested models, with the Phi-3-mini model experiencing over a 65% drop, and even stronger models such as o1-preview showing significant declines.
Another group of researchers reported a similar conclusion on the issue of logical reasoning. In (mostly) plain English, they wrote:
The findings in this study suggest, with statistical guarantee, that most LLMs still struggle with logical reasoning. While they may perform well on classic problems, their success largely depends on recognizing superficial patterns with strong token bias, thereby raising concerns about their actual reasoning and generalization abilities.
To understand this more deeply, check out this paper about what token bias is and why it matters. In short, data (in this case, words) are assigned unique IDs—or, tokens. When the AI is presented with a task that includes sufficient numbers and consistency of tokens, it performs well.
If, however, certain critical tokens are absent or inconsistent, the performance degrades. In the paper I linked to above, the researchers made changes to tolkenized text in various tests, such as switching “Linda” to “Luna.” Because the AI’s outcome depended upon predictions made by the appearance of the token for Linda, its accuracy plummeted when Luna was substituted.
Put another way, the AI could not understand that Linda and Luna meant the same thing even when the context clearly indicated it, so it treated the tokens of each as different mathematical variables. Discrepancies like this decreased accuracy by more than 70% in some cases.
Humans accommodate such errors effortlessly most of the time. Prime example: the fact that you can understand any of your kids’ text messages.
Photo by Jonathan Cosens Photography on Unsplash
Reasoning “like a human” is probably the wrong goal
I wrote a lengthy piece on why AI will probably never think like a human—indeed, can’t think like a human (scroll to the end of this essay for the link). The core argument is that we expect AI to function like a machine. That is, it must be logical, analytical, and—most importantly—accurate. But a lot of AI platforms—and virtually all LLMs—necessarily train on human inputs that make them eminently fallible:
AI has been designed to emulate thinking like a human, while adopting its worst elements—racism, misogyny, profiteering, lying—rather than its best. As long as people hold onto the belief that machines produce outputs [that are not influenced] by human deficiencies like bias, malevolence, or mistake, then far too many will accept as true any number of perversions.
What LLMs do is perform pattern recognition and essentially spit out a result based on the previously observed trajectory of data inputs. They do this having learned from a large sum of human activity, and effectively attempt to emulate what an ‘average’ human would do next. This is why tokens are so important.
But determining an outcome based on a probabilistic calculation becomes nearly impossible as the dataset grows larger, specifically when it involves text. This is because the bigger the dataset is, the more potential there is for redundant, erroneous, or conflicting information. One study illustrated this, showing how the size of a training set is immaterial to the quality of function of an AI after a certain point, and may even deteriorate it.
Moreover, all LLMs perform particularly poorly when a question requires understanding concepts not explicitly covered within the training data (another aspect of the token bias problem). AIs have no way to contextualize from experience, like humans do, which many have suggested contributes to hallucinations (the term to describe fabricated outputs).
Humans also generally perform well in spatial, temporal, and causal reasoning. LLMs simply do not have any way to comprehend these sorts of calculations if their training sets do not specifically cover them. As the physicist and philosopher Ragnar Fjelland put it, “the real problem is that computers are not in the world, because they are not embodied.”
I proposed previously that such capabilities can only be acquired from existence and interaction in the real world and through “our ability to ‘put ourselves in the shoes’ of someone else.” One suggested way of doing this for AI is by putting it into robots, something I am not at all enthused about. You want Terminators? This is how we get Terminators.
The human user’s own expectations also affect the perceived abilities of the LLM, according to MIT researchers. One issue has to do with the versatility of LLMs. The researchers point out that many intended uses will lack any reasonable benchmark to evaluate the efficacy of the LLM’s performance. In other cases, a user may “misalign” his or her expectations based on an LLM’s excellence in one area. Adam Zewe of MIT News explained:
If someone sees that an LLM can correctly answer questions about matrix inversion, they might also assume it can ace questions about simple arithmetic. A model that is misaligned with this function—one that doesn’t perform well on questions a human expects it to answer correctly—could fail when deployed.
Not knowing what LLMs really are or are intended to do, we end up struggling with evaluating whether they do anything well. Furthermore, it opens the door for the snake oil sellers to pretend that they are capable of things that research proves they are not. It is this series of disconnects that has led me to believe that many people will never accept that AI (in any form) is conscious. I discussed the issue in greater detail here:
Why AI Will Never be Conscious
I am going to make a bold prediction in this piece: Artificial Intelligence will never reach consciousness in any manifestation that is meaningfully comparable to ours. Furthermore, it will never be ‘alive’ or ‘sentient’ in any objective way.
What do we even mean?
When it comes to evaluating AI intelligence, reasoning, or consciousness, it seems we pigeonhole the debate by trying to make a concrete determination without having a concrete understanding of the concept. In other words, humans have trouble defining what we mean by “reasoning” (for instance), so how can we objectively identify whether a nonhuman entity is engaging in that act?
It is even hard to say for sure what we mean by “thinking like a human.” After all, there are eight billion people on earth, and while we might identify certain shared processes or traits, we nevertheless know for certain that people do not think alike. Read any political blog for evidence of that.
Without reaching some widely accepted agreement on what we mean when we are probing whether a machine thinks, reasons, or even lives, we will never be able to decide if it is doing any of those things. Indeed, we have lived alongside animals for millennia, yet there remains substantial disagreement on whether they are conscious or, if they are, in what sense.
If only we would tread more carefully
It is prudent to avoid relying on AI or LLMs as any sort of substitute for human capacity. They are effective, but very limited, tools. Just as we would not consider ourselves physically superior because we can pound a nail into hard wood with a hammer, we should not presume that our intellect is somehow enhanced by a computer program that cannot discern between Linda and Luna.
As long as Big Tech persists in formulating and selling AI as both a parallel and supplement to human thinking, it will inevitably be more harmful than good. Or, like blockchain, the hype will die and society will move on to the next shiny rock.
And just as using AI to bolster our individual intellects is a terrible idea, so too is it a bad idea to try and develop AI as machines that employ human-like thinking on proverbial steroids. As I wrote earlier:
It is not at all clear to me that we would even want AI to think like a human, either from a content perspective or functionally. It seems there are considerable dangers to providing machines humanlike ‘mental’ abilities, already manifested in less humanlike machines such as today’s LLMs. These include incorporating the biases and darker sides of humans that are only elevated by AI’s certain superior abilities.
Coupled with the ability to spread AI outputs with extraordinary precision, speed, or boundlessness, AI will become ever more hazardous, especially when in the wrong hands. And so far, no one has shown an effective enough way of avoiding these pitfalls.
See you Saturday.
If you enjoyed this article, consider giving it a like. It helps placate the all-controlling algorithm, or Buy me A Coffee if you wish to support my work.
* Articles post on Wednesdays and Saturdays *
I hearted this as acknowledgement... But I am not using or playing around with AI except for using Google search. I really had hope that AI was going to be this temple to vast knowledge to talk to and give me answers of facts and be the truth of all my questions.. 😕😕😕