AI hallucinations are still real, BUT the studies are missing the point

There is a big different between ChatGPT the app, and OpenAI, the underlying powerhouse

Feb 10, 2026

I want you to imagine hiring an intern.

They’re fast. Incredibly fast. They write clean copy, pull together reports in minutes, and never complain. But there’s a catch; about 40% of the time, they just... make things up. Not always obviously wrong stuff, think more subtle things. A stat that sounds right but isn’t. A source that doesn’t exist. A claim that contradicts itself two paragraphs later.

And they deliver it all with the absolute confidence of a little boy explaining to his mom that he doesn’t need a jacket because he ‘never gets cold.’ (<--- Been there!)

Would you let that intern publish to your blog? Send client reports? Write your product descriptions?

A new study tested 600 prompts across the six major AI platforms (ChatGPT, Claude, Gemini, Perplexity, Grok, and Copilot (which is essentially ChatGPT)) and had humans grade every response. The best performer was ChatGPT at 59.7% fully correct. The worst was Grok at 39.6%.

Let me say that differently: the best AI available right now gets it wrong four out of ten times.

Interestingly, Claude scored slightly lower on fully correct answers (55.1%) but had the lowest error rate of any model at just 6.2%. The difference? When Claude doesn’t know something, he (or “it?”) tends to skip it rather than fake it.

That’s actually the behavior I want.

Give me silence over a confident lie any day.

👉But, the studies are not really addressing the core issue.

These studies are referring to using ChatGPT and the like in their native apps; the UI you use when you chat with these tools. However, AI applications sit on top of these underlying models and can therefore have a very different experience.

I COULD generate a piece of slop content for a client, and it cost me pennies (when you use the backend of these models, they charge you in tokens).

However, we aim for human-quality content and might spend many tokens to create it. Why? Because we know not to trust the raw output from the underlying models.

Instead, and perhaps a good thing for all our human readers, the only way to reliably produce human-quality content that is error-free is to mimic human behavior. Create a process that researches, checks, outlines, writes, edits, writes some more, adds search ranking factors….. That’s what a true platform does, over simply using ChatGPT or a $100/month ChatGPT wrapper.

So when you see the stats, think of them less as a report card on AI and more as a report card on using AI without a process.

Dax is the CEO & Co-Founder at FOMO.ai. Take our new free course, ‘How To Win With AI Search’.

FOMO.ai - Generate Sales & Leads With Ai Search

Discussion about this post

Ready for more?