By OpenAI's own testing,Beverly Lynne - Bikini Royale 2 (2010) its newest reasoning models, o3 and o4-mini, hallucinate significantly higher than o1.
First reported by TechCrunch, OpenAI's system card detailed the PersonQA evaluation results, designed to test for hallucinations. From the results of this evaluation, o3's hallucination rate is 33 percent, and o4-mini's hallucination rate is 48 percent — almost half of the time. By comparison, o1's hallucination rate is 16 percent, meaning o3 hallucinated about twice as often.
SEE ALSO: All the AI news of the week: ChatGPT debuts o3 and o4-mini, Gemini talks to dolphinsThe system card noted how o3 "tends to make more claims overall, leading to more accurate claims as well as more inaccurate/hallucinated claims." But OpenAI doesn't know the underlying cause, simply saying, "More research is needed to understand the cause of this result."
OpenAI's reasoning models are billed as more accurate than its non-reasoning models like GPT-4o and GPT-4.5 because they use more computation to "spend more time thinking before they respond," as described in the o1 announcement. Rather than largely relying on stochastic methods to provide an answer, the o-series models are trained to "refine their thinking process, try different strategies, and recognize their mistakes."
However, the system card for GPT-4.5, which was released in February, shows a 19 percent hallucination rate on the PersonQA evaluation. The same card also compares it to GPT-4o, which had a 30 percent hallucination rate.
In a statement to Mashable, an OpenAI spokesperson said, “Addressing hallucinations across all our models is an ongoing area of research, and we’re continually working to improve their accuracy and reliability.”
Evaluation benchmarks are tricky. They can be subjective, especially if developed in-house, and research has found flaws in their datasets and even how they evaluate models.
Plus, some rely on different benchmarks and methods to test accuracy and hallucinations. HuggingFace's hallucination benchmark evaluates models on the "occurrence of hallucinations in generated summaries" from around 1,000 public documents and found much lower hallucination rates across the board for major models on the market than OpenAI's evaluations. GPT-4o scored 1.5 percent, GPT-4.5 preview 1.2 percent, and o3-mini-high with reasoning scored 0.8 percent. It's worth noting o3 and o4-mini weren't included in the current leaderboard.
That's all to say; even industry standard benchmarks make it difficult to assess hallucination rates.
Then there's the added complexity that models tend to be more accurate when tapping into web search to source their answers. But in order to use ChatGPT search, OpenAI shares data with third-party search providers, and Enterprise customers using OpenAI models internally might not be willing to expose their prompts to that.
Regardless, if OpenAI is saying their brand-new o3 and o4-mini models hallucinate higher than their non-reasoning models, that might be a problem for its users.
UPDATE: Apr. 21, 2025, 1:16 p.m. EDT This story has been updated with a statement from OpenAI.
Topics ChatGPT OpenAI
The iPhone X may not be selling as well as Apple hopedHulu orders two seasons of 'Animaniacs'The iPhone X may not be selling as well as Apple hopedHBO announces 'Game of Thrones' will return in 2019Star Wars effects master shares secrets of blue milk scene in 'Last Jedi'Taco Bell is launching Nacho Fries because the world needs more friesEven weather experts are in awe of the 'bomb cyclone'Buffalo Bills fans donate $60,000 to an opposing team's charity$1.3 million vodka which appeared in 'House of Cards' stolen from barHulu orders two seasons of 'Animaniacs'The horribly bleak answer to that lingering question in 'Black Mirror'Nintendo is on pace to become the best selling console of all timeHow Seth Meyers will take on Hollywood harassment at the Golden GlobesLenovo unveils nine new ThinkPad laptops ahead of CES 2018New Intel CPU bug reportedly affects all current chips10 TV shows to binge during Winter Storm GraysonBeyonce leads huge Coachella 2018 lineupOverrated InThe 1980s did winter coats better than anyone elseApple: all Macs and iOS devices are affected by Meltdown and Spectre Volkswagen, Xpeng expand electric vehicle partnership · TechNode China’s anti Light Year founder Wang Huiwen returns to Meituan as part Redmi launches Harry Potter Edition of new Turbo 3 smartphone · TechNode New Lovers x Playboy Pleasure sex toy drop: May 2024 Russell T Davies explains the 'Bridgerton'/'Doctor Who' conundrum Best TheraFace Pro deal: $339 Memorial Day sale price still going Malala Yousafzai in 'We Are Lady Parts' is the cameo of the year Xiaomi ranks third in Q1 global smartphone shipments · TechNode Trump guilty verdict: Now what, asks the internet How to leave a group chat on iPhone Best smartphone deal: Get the Google Pixel 7 for $245 off at Woot China’s Chery will reportedly launch a new EV brand this year · TechNode Huawei unveils new Pura 70 series smartphones, expected to be on sale from April 18 · TechNode Get the Apple iPad Air (5th gen) at Amazon for $549.99 OpenAI stopped five covert influence operations in the last three months China’s chip production soars 40% in Q1 despite US constraints · TechNode Best Apple Watch deal: Get the Apple Watch Ultra 2 at it lowest price yet JD to invest $138 million into supporting short video content · TechNode Meituan names a new CEO to lead core local commerce · TechNode
2.403s , 8226.640625 kb
Copyright © 2025 Powered by 【Beverly Lynne - Bikini Royale 2 (2010)】,Warmth Information Network