Vision Hallucinations
When models lie with confidence
The Problem: Vision models hallucinate with perfect confidence — they see objects that don't exist, swap attributes, confuse spatial relationships, complete obscured text, and default to expected quantities. Without knowing these failure modes, you cannot build reliable applications.
The Solution: Five Types of Vision Hallucinations
A hallucination is any output a model presents as fact that isn't actually supported by its input. Vision models don't just make mistakes — they confidently fabricate details that look perfectly plausible. A model might "see" a cat that's really a cushion pattern, swap left and right, invent text on a partially obscured sign, or default to expected quantities instead of actually counting. What makes this dangerous is that the model shows no signal of uncertainty: a fabricated answer and a correct one are phrased with exactly the same confidence.
Why it happens
A vision-language model is fundamentally a next-token predictorconditioned on image patches plus your prompt. It is trained to produce the most probable continuation, not to report calibrated confidence. When the pixels are ambiguous — a blurry sign, a cluttered shelf, an occluded object — the strongest prior often comes from language statistics rather than the image. The model has seen the phrase "a dozen eggs" far more often than it has carefully counted them, so it leans on that prior. The same mechanism that lets it caption a photo fluently also lets it confabulate when evidence is thin. This is why the five failure modes — object, attribute, spatial, OCR, and counting — cluster around exactly the cases where the visual signal is weakest.
Worked example
Show a model a photo of a parking lot with seven cars and ask "How many cars are here?" A common failure is a fast, round answer like "about a dozen" that ignores the actual scene. The fix is to force grounding: prompt it with "Count each car one by one and list its color before giving a total. If any car is partly hidden and you are unsure, say so." Asking the model to enumerate evidence and to explicitly flag uncertainty consistently reduces both counting and object hallucinations — not because the model became smarter, but because you removed the shortcut to a plausible-sounding guess. Treat any single-glance answer about quantities, text, or left/right as unverified until grounded.
Think of it like a confident witness giving wrong testimony in court:
- 1. Object hallucination: Model "sees" objects that don't exist — a cat from a cushion pattern, a person from a shadow
- 2. Attribute hallucination: Wrong color, size, or count — swaps attributes between adjacent objects
- 3. Spatial hallucination: Left/right and front/back confusion — the most common spatial error in vision models
- 4. OCR hallucination: Completes obscured or partial text with plausible but incorrect content
- 5. Counting hallucination: Defaults to expected quantities (12 eggs in a carton) instead of actually counting
Where This Matters Most
- Quality Assurance: Detect hallucinated defects in manufacturing — model may "see" cracks that are just shadows
- Medical Imaging: Prevent false positives: model might hallucinate tumors from image artifacts or noise
- Autonomous Driving: Critical safety: model must not hallucinate pedestrians or miss real obstacles
- Legal Document Review: Prevent fabricated clauses or amounts — hallucinated text in contracts has legal consequences
Fun Fact: Research from OpenAI (2025) shows that models are trained to "bluff" rather than express uncertainty. When a model says "I see a red car on the left" with 100% confidence, its internal confidence might actually be only 60%. This is why explicit verification prompts are so important.
Try It Yourself!
Explore the gallery below: see 5 types of vision hallucinations, how they look in practice, and specific prompt strategies to catch and prevent each one.
Frequently asked questions
What is a vision model hallucination?
A hallucination is a detail the model states as fact that isn't actually in the image — a non-existent object, the wrong color, swapped left/right, invented text on a sign, or a made-up count. It's dangerous because the fabricated answer is phrased with exactly the same confidence as a correct one.
Why do vision models hallucinate?
A vision-language model predicts the most probable text continuation from image patches plus your prompt; it isn't reporting calibrated confidence. When pixels are ambiguous (blur, occlusion, tiny text), it leans on language priors — like the common phrase 'a dozen eggs' — instead of analyzing the image, producing a plausible guess.
How do I reduce hallucinations with vision models?
Force grounding: ask the model to enumerate evidence one item at a time (count each object, name each color), explicitly allow an 'unsure / not visible' answer, and require it to flag occluded or blurry regions. Treat any single-glance answer about quantity, text, or direction as unverified until backed by step-by-step reasoning.
What are the types of visual hallucinations?
There are five modes: object (sees things that aren't there), attribute (wrong color/size or swapped attributes), spatial (left/right, front/back confusion), OCR (completes obscured text), and counting (defaults to expected quantities instead of counting). They cluster exactly where the visual signal is weakest.
Try it yourself
Interactive demo of this technique
Detect a counting hallucination in a vision model response
There are 12 eggs in the carton.
Grid breakdown 2×6:
- [1,1]: EGG [1,2]: EGG [1,3]: EMPTY [1,4]: EGG [1,5]: EGG [1,6]: EGG
- [2,1]: EGG [2,2]: EMPTY [2,3]: EGG [2,4]: EMPTY [2,5]: EGG [2,6]: EGG
Eggs: 9. Empty slots: 3. Total: 9 eggs out of 12 slots.
Models default to "typical" quantities (12 eggs in a carton). Chain-of-Thought with grid forces counting each element individually.
Create a free account to solve challenges
3 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path