By Cade Metz & Karen Weise
Last month, an AI bot that handles tech support for Cursor, an up-and-coming tool for computer programmers, alerted several customers about a change in company policy. It said they were no longer allowed to useCursor on more than just one computer.
In angry posts to internet message boards, the customers complained. Some cancelled their Cursor accounts. And some got even angrier when they realised what had happened: The AI bot had announced a policy change that did not exist. “We have no such policy. You’re of course free to use Cursor on multiple machines,” the firm’s chief executive and co-founder, Michael Truell, wrote in a Reddit post. “Unfortunately, this is an incorrect response from a front-line AI support bot.”
More than two years after the arrival of ChatGPT, tech companies, office workers and everyday consumers are using AI bots for an increasingly wide array of tasks. But there is still no way of ensuring that these systems produce accurate information.
The newest and most powerful technologies — so-called reasoning systems from companies like OpenAI, Google and the Chinese start-up DeepSeek — are generating more errors, not fewer. As their math skills have notably improved, their handle on facts has gotten shakier. It is not entirely clear why. Today’s AI bots are based on complex mathematical systems that learn their skills by analysing enormous amounts of digital data. They do not — and cannot — decide what is true and what is false. Sometimes, they just make stuff up, a phenomenon some AI researchers call hallucinations. On one test, the hallucination rates of newer AI systems were as high as 79 per cent.
These systems use mathematical probabilities to guess the best response, not a strict set of rules defined by human engineers. So they make a certain number of mistakes. “Despite our best efforts, they will always hallucinate,” said Amr Awadallah, the chief executive of Vectara, a start-up that builds AI tools for businesses, and a former Google executive. “That will never go away.”
For several years, this phenomenon has raised concerns about the reliability of these systems. Though they are useful in some situations — like writing term papers, summarising office documents and generating computer code — their mistakes can cause problems.
The AI bots tied to search engines like Google and Bing sometimes generate search results that are laughably wrong. If you ask them for a good marathon on the West Coast, they might suggest a race in Philadelphia. If they tell you the number of households in Illinois, they might cite a source that does not include that information.
Those hallucinations may not be a big problem for many people, but it is a serious issue for anyone using the technology with court documents, medical information or sensitive business data.
“You spend a lot of time trying to figure out which responses are factual and which aren’t,” said Pratik Verma, co-founder and chief executive of Okahu, a company that helps businesses navigate the hallucination problem. “Not dealing with these errors properly basically eliminates the value of AI systems, which are supposed to automate tasks for you.”
For more than two years, companies like OpenAI and Google steadily improved their AI systems and reduced the frequency of these errors. But with the use of new reasoning systems, errors are rising. The latest OpenAI systems hallucinate at a higher rate than the company’s previous system, according to the company’s own tests.
The company found that o3 — its most powerful system — hallucinated 33 per cent of the time when running its PersonQA benchmark test, which involves answering questions about public figures. That is more than twice the hallucination rate of OpenAI’s previous reasoning system, called o1. The new o4-mini hallucinated at an even higher rate: 48 per cent. When running another test called SimpleQA, which asks more general questions, the hallucination rates for o3 and o4-mini were 51 per cent and 79 per cent. The previous system, o1, hallucinated 44 per cent of the time.
Tests by independent companies and researchers indicate that hallucination rates are also rising for reasoning models from companies such as Google and DeepSeek.
Since late 2023, Awadallah’s company, Vectara, has tracked how often chatbots veer from the truth. The company asks these systems to perform a straightforward task that is readily verified: Summarise specific news articles. Even then, chatbots persistently invent information.
Vectara’s original research estimated that in this situation chatbots made up information at least 3 per cent of the time and sometimes as much as 27 per cent.
In the year and a half since, companies such as OpenAI and Google pushed those numbers down into the 1 or 2 percent range. Others, such as the San Francisco start-up Anthropic, hovered around 4 per cent. But hallucination rates on this test have risen with reasoning systems. DeepSeek’s reasoning system, R1, hallucinated 14.3 per cent of the time. OpenAI’s o3 climbed to 6.8.
TECH TROUBLES
Newer reasoning systems from OpenAI, Google, and DeepSeek generate more errors
AI bots cannot decide what is true and what is false
Sometimes when they make stuff up,
AI researchers call the phenomenon hallucinations
Hallucination rates of newer AI systems were as high as 79%
OpenAI said more research was needed to understand the cause
©2025 The New York Times News Service