Lesson 5Critical Thinking

Vision Hallucinations

When models lie with confidence

The Problem: Vision models hallucinate with perfect confidence — they see objects that don't exist, swap attributes, confuse spatial relationships, complete obscured text, and default to expected quantities. Without knowing these failure modes, you cannot build reliable applications.

The Solution: Five Types of Vision Hallucinations

A hallucination is any output a model presents as fact that isn't actually supported by its input. Vision models don't just make mistakes — they confidently fabricate details that look perfectly plausible. A model might "see" a cat that's really a cushion pattern, swap left and right, invent text on a partially obscured sign, or default to expected quantities instead of actually counting. What makes this dangerous is that the model shows no signal of uncertainty: a fabricated answer and a correct one are phrased with exactly the same confidence.

Why it happens

A vision-language model is fundamentally a next-token predictorconditioned on image patches plus your prompt. It is trained to produce the most probable continuation, not to report calibrated confidence. When the pixels are ambiguous — a blurry sign, a cluttered shelf, an occluded object — the strongest prior often comes from language statistics rather than the image. The model has seen the phrase "a dozen eggs" far more often than it has carefully counted them, so it leans on that prior. The same mechanism that lets it caption a photo fluently also lets it confabulate when evidence is thin. This is why the five failure modes — object, attribute, spatial, OCR, and counting — cluster around exactly the cases where the visual signal is weakest.

Worked example

Show a model a photo of a parking lot with seven cars and ask "How many cars are here?" A common failure is a fast, round answer like "about a dozen" that ignores the actual scene. The fix is to force grounding: prompt it with "Count each car one by one and list its color before giving a total. If any car is partly hidden and you are unsure, say so." Asking the model to enumerate evidence and to explicitly flag uncertainty consistently reduces both counting and object hallucinations — not because the model became smarter, but because you removed the shortcut to a plausible-sounding guess. Treat any single-glance answer about quantities, text, or left/right as unverified until grounded.

Think of it like a confident witness giving wrong testimony in court:

1. Object hallucination: Model "sees" objects that don't exist — a cat from a cushion pattern, a person from a shadow
2. Attribute hallucination: Wrong color, size, or count — swaps attributes between adjacent objects
3. Spatial hallucination: Left/right and front/back confusion — the most common spatial error in vision models
4. OCR hallucination: Completes obscured or partial text with plausible but incorrect content
5. Counting hallucination: Defaults to expected quantities (12 eggs in a carton) instead of actually counting

Where This Matters Most

Quality Assurance: Detect hallucinated defects in manufacturing — model may "see" cracks that are just shadows
Medical Imaging: Prevent false positives: model might hallucinate tumors from image artifacts or noise
Autonomous Driving: Critical safety: model must not hallucinate pedestrians or miss real obstacles
Legal Document Review: Prevent fabricated clauses or amounts — hallucinated text in contracts has legal consequences

Fun Fact: Research from OpenAI (2025) shows that models are trained to "bluff" rather than express uncertainty. When a model says "I see a red car on the left" with 100% confidence, its internal confidence might actually be only 60%. This is why explicit verification prompts are so important.

Try It Yourself!

Explore the gallery below: see 5 types of vision hallucinations, how they look in practice, and specific prompt strategies to catch and prevent each one.

Frequently asked questions

What is a vision model hallucination?

A hallucination is a detail the model states as fact that isn't actually in the image — a non-existent object, the wrong color, swapped left/right, invented text on a sign, or a made-up count. It's dangerous because the fabricated answer is phrased with exactly the same confidence as a correct one.

Why do vision models hallucinate?

A vision-language model predicts the most probable text continuation from image patches plus your prompt; it isn't reporting calibrated confidence. When pixels are ambiguous (blur, occlusion, tiny text), it leans on language priors — like the common phrase 'a dozen eggs' — instead of analyzing the image, producing a plausible guess.

How do I reduce hallucinations with vision models?

Force grounding: ask the model to enumerate evidence one item at a time (count each object, name each color), explicitly allow an 'unsure / not visible' answer, and require it to flag occluded or blurry regions. Treat any single-glance answer about quantity, text, or direction as unverified until backed by step-by-step reasoning.

What are the types of visual hallucinations?

There are five modes: object (sees things that aren't there), attribute (wrong color/size or swapped attributes), spatial (left/right, front/back confusion), OCR (completes obscured text), and counting (defaults to expected quantities instead of counting). They cluster exactly where the visual signal is weakest.

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerAnalysis

Detect a counting hallucination in a vision model response

Without technique

Prompt

How many eggs are in this carton? [Photo: egg carton with 9 eggs out of 12 slots]

Response

There are 12 eggs in the carton.

Tokens:280/12

Time:1500ms

Quality:

With multimodal-hallucinations

With technique

Prompt

Think step by step. Divide the carton into a 2×6 grid. For each cell left to right, top to bottom: - Cell [row, col]: EGG / EMPTY After listing all cells, count the total number of eggs. [Photo: egg carton with 9 eggs out of 12 slots]

Response

Grid breakdown 2×6:

[1,1]: EGG [1,2]: EGG [1,3]: EMPTY [1,4]: EGG [1,5]: EGG [1,6]: EGG
[2,1]: EGG [2,2]: EMPTY [2,3]: EGG [2,4]: EMPTY [2,5]: EGG [2,6]: EGG

Eggs: 9. Empty slots: 3. Total: 9 eggs out of 12 slots.

👁️Breaking the carton into a grid for systematic enumeration

🧠Checking each cell individually instead of guessing the total

🔢Summing results: 9 eggs, 3 empty slots

Tokens:350/120

Time:3200ms

Quality:

Why this works

Models default to "typical" quantities (12 eggs in a carton). Chain-of-Thought with grid forces counting each element individually.

1 / 2

Practice Challenges

Create a free account to solve challenges

3 AI-verified challenges for this lesson

This lesson is part of a structured LLM course.

My Learning Path

Lesson 5Critical Thinking

Vision Hallucinations

When models lie with confidence

The Solution: Five Types of Vision Hallucinations

Why it happens

Worked example

Think of it like a confident witness giving wrong testimony in court:

1. Object hallucination: Model "sees" objects that don't exist — a cat from a cushion pattern, a person from a shadow
2. Attribute hallucination: Wrong color, size, or count — swaps attributes between adjacent objects
3. Spatial hallucination: Left/right and front/back confusion — the most common spatial error in vision models
4. OCR hallucination: Completes obscured or partial text with plausible but incorrect content
5. Counting hallucination: Defaults to expected quantities (12 eggs in a carton) instead of actually counting

Where This Matters Most

Quality Assurance: Detect hallucinated defects in manufacturing — model may "see" cracks that are just shadows
Medical Imaging: Prevent false positives: model might hallucinate tumors from image artifacts or noise
Autonomous Driving: Critical safety: model must not hallucinate pedestrians or miss real obstacles
Legal Document Review: Prevent fabricated clauses or amounts — hallucinated text in contracts has legal consequences

Try It Yourself!

Explore the gallery below: see 5 types of vision hallucinations, how they look in practice, and specific prompt strategies to catch and prevent each one.

Frequently asked questions

What is a vision model hallucination?

Why do vision models hallucinate?

How do I reduce hallucinations with vision models?

What are the types of visual hallucinations?

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerAnalysis

Detect a counting hallucination in a vision model response

Without technique

Prompt

How many eggs are in this carton? [Photo: egg carton with 9 eggs out of 12 slots]

Response

There are 12 eggs in the carton.

Tokens:280/12

Time:1500ms

Quality:

With multimodal-hallucinations

With technique

Prompt

Response

Grid breakdown 2×6:

[1,1]: EGG [1,2]: EGG [1,3]: EMPTY [1,4]: EGG [1,5]: EGG [1,6]: EGG
[2,1]: EGG [2,2]: EMPTY [2,3]: EGG [2,4]: EMPTY [2,5]: EGG [2,6]: EGG

Eggs: 9. Empty slots: 3. Total: 9 eggs out of 12 slots.

👁️Breaking the carton into a grid for systematic enumeration

🧠Checking each cell individually instead of guessing the total

🔢Summing results: 9 eggs, 3 empty slots

Tokens:350/120

Time:3200ms

Quality:

Why this works

Models default to "typical" quantities (12 eggs in a carton). Chain-of-Thought with grid forces counting each element individually.

1 / 2

Practice Challenges

Create a free account to solve challenges

3 AI-verified challenges for this lesson

This lesson is part of a structured LLM course.

My Learning Path