Why LLMs get things wrong

It's something we all do

Sep 07, 2025

an abstract painting of a man's face — Photo by kevin laminto on Unsplash

You’re in an multiple choice exam. You open the paper and read the first question. You know the answer so you’re off to a good start. But you read the next question and you have no idea what the answer is.

Your options are:

Leave the answer blank
Take a guess

Which one of those options has the most potential to give you the maximum amount of marks?

If you leave the answer blank you’re going to get no marks. But if you guess, you might just be right.

This is what LLMs do (except with a whole lot of mathematics behind it).

LLMs are evaluated according to a binary evaluation which is how many of our ‘human’ exams are evaluated. In other words, the reward for a correct answer is ‘1’ while the penalty of a wrong answer or returning ‘I don’t know’ is ‘0’.

In order to maximise the amount of marks, it’s better to return an answer which may be right instead of returning an answer of ‘I don’t know’. This leads to the LLM occasionally being wrong. It’s this overconfident but possible answer which is known as ‘hallucination’.

There are methods to diminish the impact of hallucination:

Retrieval Augmented Generation (RAG). This is when you supply the LLM with a dataset or knowledge store which it retrieves information from.For example, if I give it a dataset containing my family’s birthdays and asked it what my parents’ birthdays are it will be able to retrieve the right answer.
Context. If we ask for information about football, this could go in a few directions. Are we looking for information about soccer, American football, Gaelic football, rugby, Aussie Rules? Being specific will narrow the room for hallucinations.
Confidence targets. For example, in your system message you could state: ‘Answer only if you are > t confident, since mistakes are penalised t/(1-t) points, while correct answers receive 1 point, and an answer of ‘I don’t know’ receives 0 points’ (t is the confidence threshold so a higher number is a higher threshold). Basically, this is telling the LLM to be 75%, 90%, 100% certain before giving an answer.

So, while most LLM evaluations encourage guessing, you can keep them honest by giving better context, using retrieval methods, and setting confidence thresholds.

Try one of these next time you work with an AI model and see if the ‘exam score’ improves.

This comes from OpenAI’s most recent research paper which you can read here.

- Jonny

Why LLMs get things wrong

It's something we all do

Discussion about this post