Vision LLMs
GPT-4V, Claude Vision
The Problem: AI can read text, but what about images? How can we make AI "see" and understand visual content like photos and screenshots?
The Solution: Teaching AI to See
Vision capabilities allow LLMs to process and understand images alongside text. A standard text-only model can only read words; a vision model can also "look" at a photo, a screenshot, or a scanned document and reason about what it contains. It's like describing a photo to someone over the phone — except the AI now sees the photo itself, so it can answer questions about it, summarize a chart, or transcribe text from a sign.
How it works
Under the hood, the image is not magically "understood" pixel by pixel. First it is cut into a grid of small fixed-size squares called patches (for example, sixteen-by-sixteen pixels each). A vision encoder — usually a Transformer — turns every patch into a vector of numbers, an embedding. Through self-attention, the patches compare themselves to one another ("this patch is part of a face, that one is part of a hat — they belong together"), building a representation of the whole scene. These visual tokens are then placed in the same space as text tokens, so the language model can reason over words and pixels jointly. This is the core idea behind models like CLIP and the vision variants of GPT-4o, Claude, and Gemini.
When to use it and what to watch for
Reach for vision when the answer lives in an image: generating alt-text for accessibility, reading invoices and forms, extracting data from charts, or answering "what is wrong in this screenshot?". But keep two tradeoffs in mind. First, cost and latency scale with resolution — more pixels mean more patches, more tokens, and a bigger bill, so downscale images that don't need fine detail. Second, vision models still hallucinate: they may confidently miscount objects or misread blurry text. A concrete example: ask a model "how many people are in this photo?" on a crowded image and it might answer "about 8" when there are 11 — close, but wrong. For anything where exact counts or exact characters matter, verify the output rather than trusting it blindly.
Think of it like describing a photo over the phone:
- 1. Input image: A 224×224 image enters the Vision Transformer (ViT)
- 2. Split into patches: Image is divided into 196 patches of 16×16 pixels each — like cutting a photo into a grid
- 3. Encode to visual tokens: Each patch becomes a visual token — 196 tokens total, like words in a sentence for the model
- 4. Self-attention: Tokens attend to each other: "this patch has a face, that patch has a hat — they belong together"
- 5. Merge with text: Visual tokens join text tokens. The model reasons over both to answer questions or describe the scene
Higher resolution = more patches = more tokens = higher cost. A 512×512 image produces ~1024 tokens. A 4K image can exceed 10,000 tokens.
Where Is This Used?
- Image Description: Generating alt-text for accessibility
- Document Analysis: Reading charts, forms, and screenshots
- Visual Q&A: Answering questions about photos
- Content Moderation: Detecting inappropriate images
Fun Fact: Modern vision models can read text in images (OCR), understand memes, analyze charts, and even describe art style! They combine visual understanding with language knowledge for powerful multimodal reasoning.
Try It Yourself!
Use the interactive example below to see how AI can analyze and describe different types of images.
Want to optimize image costs? See the full cost calculator in the Multimodal Costs lesson.
Common Failure Modes
- Model says 'a cat on the couch' but it's actually a cushion pattern
- Model counts 12 eggs but there are only 11
- Model reads 'OPEN' on a sign that actually says 'OPER'
How Vision LLMs Work
JPEG, PNG, WebP, GIF
Image Token Costs
Low res
~85 tokens
512x512
Medium
~170 tokens
768x768
High res
~1500 tokens
2048x2048
Vision Model Comparison
Popular Use Cases
Best Practices
- Size: Resize images to save tokens if details are not critical
- Format: Use JPEG for photos, PNG for screenshots with text
- Prompt: Be specific about what you want to know about the image
- Multiple images: Number or describe images in your prompt
Frequently asked questions
How does a neural network actually "see" an image?
It doesn't process pixels one by one. The image is first cut into a grid of fixed-size patches (e.g. 16×16 pixels), then a vision encoder — usually a Transformer — turns each patch into an embedding (a vector of numbers). Through self-attention the patches compare themselves to one another and build a representation of the whole scene. These visual tokens are placed in the same space as text tokens, so the language model can reason over words and pixels together.
How is a vision model different from a regular text-only LLM?
A text-only LLM can only read and write text. A vision model also accepts images — photos, screenshots, scanned documents — and can answer questions about them, describe their contents, read text from them (OCR), and analyze charts. It is essentially the same language model plus a vision encoder that converts the image into tokens the model can reason over.
Why is processing images more expensive than text?
Cost and latency scale with resolution: more pixels mean more patches, more visual tokens, and a bigger bill. A 512×512 image produces around 1024 tokens, while a 4K image can exceed 10,000 tokens. If you don't need fine detail, downscale the image before sending it to save tokens and money.
Can vision models make mistakes or hallucinate?
Yes. Vision models can confidently but incorrectly count objects, confuse similar items, and misread blurry or small text. For example, on a crowded photo a model might answer "about 8" people when there are actually 11. For anything where exact counts or exact characters matter — invoices, forms, meter readings — verify the output instead of trusting it blindly.
Try it yourself
Interactive demo of this technique
Get a useful description of an interior photo instead of a generic one
The image shows a kitchen. It is bright and looks modern. There are cabinets, a countertop, a sink, and a window. There are items on the table.
1. Style: Scandinavian minimalism — white cabinets, natural wood, no decorative clutter, functionality-first.
2. Palette: Base — white (#FAFAFA) + light oak. Accents — matte black hardware, green plants.
3. Materials: Countertop — solid oak (or quality veneer), cabinets — matte enamel, floor — light wood-look laminate.
4. Lighting: Large window (north-facing?) — soft diffused light. Additional: LED spots under upper cabinets.
5. Improvements: Add backsplash (wall is unprotected near sink), replace overhead light with pendant over island for coziness.
"Describe this image" gives a useless result. Specific questions + expert role turn a description into professional analysis with actionable recommendations.
Create a free account to solve challenges
6 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path