Vision LLMs
GPT-4V, Claude Vision
The Problem: AI can read text, but what about images? How can we make AI "see" and understand visual content like photos and screenshots?
The Solution: Teaching AI to See
Vision capabilities allow LLMs to process and understand images alongside text. It's like describing a photo to someone over the phone — the AI can now see the photo itself and describe what's in it. Images are converted into embeddings that the model can reason about, enabling tasks like OCR and visual Q&A.
Think of it like describing a photo over the phone:
- 1. Input image: A 224×224 image enters the Vision Transformer (ViT)
- 2. Split into patches: Image is divided into 196 patches of 16×16 pixels each — like cutting a photo into a grid
- 3. Encode to visual tokens: Each patch becomes a visual token — 196 tokens total, like words in a sentence for the model
- 4. Self-attention: Tokens attend to each other: "this patch has a face, that patch has a hat — they belong together"
- 5. Merge with text: Visual tokens join text tokens. The model reasons over both to answer questions or describe the scene
Higher resolution = more patches = more tokens = higher cost. A 512×512 image produces ~1024 tokens. A 4K image can exceed 10,000 tokens.
Where Is This Used?
- Image Description: Generating alt-text for accessibility
- Document Analysis: Reading charts, forms, and screenshots
- Visual Q&A: Answering questions about photos
- Content Moderation: Detecting inappropriate images
Fun Fact: Modern vision models can read text in images (OCR), understand memes, analyze charts, and even describe art style! They combine visual understanding with language knowledge for powerful multimodal reasoning.
Try It Yourself!
Use the interactive example below to see how AI can analyze and describe different types of images.
Want to optimize image costs? See the full cost calculator in the Multimodal Costs lesson.
Common Failure Modes
- Model says 'a cat on the couch' but it's actually a cushion pattern
- Model counts 12 eggs but there are only 11
- Model reads 'OPEN' on a sign that actually says 'OPER'
How Vision LLMs Work
JPEG, PNG, WebP, GIF
Image Token Costs
Low res
~85 tokens
512x512
Medium
~170 tokens
768x768
High res
~1500 tokens
2048x2048
Vision Model Comparison
Popular Use Cases
Best Practices
- Size: Resize images to save tokens if details are not critical
- Format: Use JPEG for photos, PNG for screenshots with text
- Prompt: Be specific about what you want to know about the image
- Multiple images: Number or describe images in your prompt
Try it yourself
Interactive demo of this technique
Get a useful description of an interior photo instead of a generic one
The image shows a kitchen. It is bright and looks modern. There are cabinets, a countertop, a sink, and a window. There are items on the table.
1. Style: Scandinavian minimalism — white cabinets, natural wood, no decorative clutter, functionality-first.
2. Palette: Base — white (#FAFAFA) + light oak. Accents — matte black hardware, green plants.
3. Materials: Countertop — solid oak (or quality veneer), cabinets — matte enamel, floor — light wood-look laminate.
4. Lighting: Large window (north-facing?) — soft diffused light. Additional: LED spots under upper cabinets.
5. Improvements: Add backsplash (wall is unprotected near sink), replace overhead light with pendant over island for coziness.
"Describe this image" gives a useless result. Specific questions + expert role turn a description into professional analysis with actionable recommendations.
Create a free account to solve challenges
6 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path