Lesson 1

Vision LLMs

GPT-4V, Claude Vision

The Problem: AI can read text, but what about images? How can we make AI "see" and understand visual content like photos and screenshots?

The Solution: Teaching AI to See

Vision capabilities allow LLMs to process and understand images alongside text. A standard text-only model can only read words; a vision model can also "look" at a photo, a screenshot, or a scanned document and reason about what it contains. It's like describing a photo to someone over the phone — except the AI now sees the photo itself, so it can answer questions about it, summarize a chart, or transcribe text from a sign.

How it works

Under the hood, the image is not magically "understood" pixel by pixel. First it is cut into a grid of small fixed-size squares called patches (for example, sixteen-by-sixteen pixels each). A vision encoder — usually a Transformer — turns every patch into a vector of numbers, an embedding. Through self-attention, the patches compare themselves to one another ("this patch is part of a face, that one is part of a hat — they belong together"), building a representation of the whole scene. These visual tokens are then placed in the same space as text tokens, so the language model can reason over words and pixels jointly. This is the core idea behind models like CLIP and the vision variants of GPT-4o, Claude, and Gemini.

When to use it and what to watch for

Reach for vision when the answer lives in an image: generating alt-text for accessibility, reading invoices and forms, extracting data from charts, or answering "what is wrong in this screenshot?". But keep two tradeoffs in mind. First, cost and latency scale with resolution — more pixels mean more patches, more tokens, and a bigger bill, so downscale images that don't need fine detail. Second, vision models still hallucinate: they may confidently miscount objects or misread blurry text. A concrete example: ask a model "how many people are in this photo?" on a crowded image and it might answer "about 8" when there are 11 — close, but wrong. For anything where exact counts or exact characters matter, verify the output rather than trusting it blindly.

Think of it like describing a photo over the phone:

1. Input image: A 224×224 image enters the Vision Transformer (ViT)
2. Split into patches: Image is divided into 196 patches of 16×16 pixels each — like cutting a photo into a grid
3. Encode to visual tokens: Each patch becomes a visual token — 196 tokens total, like words in a sentence for the model
4. Self-attention: Tokens attend to each other: "this patch has a face, that patch has a hat — they belong together"
5. Merge with text: Visual tokens join text tokens. The model reasons over both to answer questions or describe the scene

Higher resolution = more patches = more tokens = higher cost. A 512×512 image produces ~1024 tokens. A 4K image can exceed 10,000 tokens.

Where Is This Used?

Image Description: Generating alt-text for accessibility
Document Analysis: Reading charts, forms, and screenshots
Visual Q&A: Answering questions about photos
Content Moderation: Detecting inappropriate images

Fun Fact: Modern vision models can read text in images (OCR), understand memes, analyze charts, and even describe art style! They combine visual understanding with language knowledge for powerful multimodal reasoning.

Try It Yourself!

Use the interactive example below to see how AI can analyze and describe different types of images.

Want to optimize image costs? See the full cost calculator in the Multimodal Costs lesson.

Common Failure Modes

Model says 'a cat on the couch' but it's actually a cushion pattern
Model counts 12 eggs but there are only 11
Model reads 'OPEN' on a sign that actually says 'OPER'

Learn all 5 types of vision hallucinations →

Vision Language Models

How Vision LLMs Work

Image Input

Visual Encoder

Image Tokens

LLM Processing

JPEG, PNG, WebP, GIF

Image Token Costs

Low res

~85 tokens

512x512

Medium

~170 tokens

768x768

High res

~1500 tokens

2048x2048

Vision Model Comparison

GPT-5(OpenAI)

$5.00/1M tokens

Best visionVideo understandingComplex reasoning

Claude Sonnet 4(Anthropic)

$3.00/1M tokens

Image analysisDocument understandingCode from screenshots

GPT-4o(OpenAI)

$2.50/1M tokens

Image descriptionOCRObject detectionCharts/diagrams

Gemini 2.0 Pro(Google)

$1.25/1M tokens

Image + Video2M contextMulti-frame analysis

Qwen2-VL 72B(Alibaba (Open))

Free (self-hosted)

Open sourceSelf-hostedGood quality

Popular Use Cases

📄Document OCR & extraction

🛍️Product image analysis

🏥Medical image interpretation

📱UI/UX screenshot analysis

📊Chart & graph understanding

♿Accessibility descriptions

Best Practices

Size: Resize images to save tokens if details are not critical
Format: Use JPEG for photos, PNG for screenshots with text
Prompt: Be specific about what you want to know about the image
Multiple images: Number or describe images in your prompt

Frequently asked questions

How does a neural network actually "see" an image?

It doesn't process pixels one by one. The image is first cut into a grid of fixed-size patches (e.g. 16×16 pixels), then a vision encoder — usually a Transformer — turns each patch into an embedding (a vector of numbers). Through self-attention the patches compare themselves to one another and build a representation of the whole scene. These visual tokens are placed in the same space as text tokens, so the language model can reason over words and pixels together.

How is a vision model different from a regular text-only LLM?

A text-only LLM can only read and write text. A vision model also accepts images — photos, screenshots, scanned documents — and can answer questions about them, describe their contents, read text from them (OCR), and analyze charts. It is essentially the same language model plus a vision encoder that converts the image into tokens the model can reason over.

Why is processing images more expensive than text?

Cost and latency scale with resolution: more pixels mean more patches, more visual tokens, and a bigger bill. A 512×512 image produces around 1024 tokens, while a 4K image can exceed 10,000 tokens. If you don't need fine detail, downscale the image before sending it to save tokens and money.

Can vision models make mistakes or hallucinate?

Yes. Vision models can confidently but incorrectly count objects, confuse similar items, and misread blurry or small text. For example, on a crowded photo a model might answer "about 8" people when there are actually 11. For anything where exact counts or exact characters matter — invoices, forms, meter readings — verify the output instead of trusting it blindly.

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerAnalysis

Get a useful description of an interior photo instead of a generic one

Without technique

Prompt

Describe this image. [Photo: Scandinavian-style kitchen]

Response

The image shows a kitchen. It is bright and looks modern. There are cabinets, a countertop, a sink, and a window. There are items on the table.

Tokens:270/45

Time:1800ms

Quality:

With Vision

With technique

Prompt

Analyze this kitchen photo as an interior designer: 1. Style: what style, by what features? 2. Color palette: main colors and accents 3. Materials: countertop, cabinets, flooring 4. Lighting: type, direction, mood 5. What could be improved? [Photo: Scandinavian-style kitchen]

Response

1. Style: Scandinavian minimalism — white cabinets, natural wood, no decorative clutter, functionality-first.

2. Palette: Base — white (#FAFAFA) + light oak. Accents — matte black hardware, green plants.

3. Materials: Countertop — solid oak (or quality veneer), cabinets — matte enamel, floor — light wood-look laminate.

4. Lighting: Large window (north-facing?) — soft diffused light. Additional: LED spots under upper cabinets.

5. Improvements: Add backsplash (wall is unprotected near sink), replace overhead light with pendant over island for coziness.

👁️"Describe this image" → model gives surface-level description: "kitchen, bright, has cabinets"

🧠Specific questions (style, palette, materials) guide the analysis deeper

🧠Role "interior designer" activates professional vocabulary and expert perspective

✅Structured prompt = structured response. Each point = separate analysis aspect

Tokens:310/200

Time:3200ms

Quality:

Why this works

"Describe this image" gives a useless result. Specific questions + expert role turn a description into professional analysis with actionable recommendations.

1 / 2

Practice Challenges

Create a free account to solve challenges

6 AI-verified challenges for this lesson

Related lessons:Embeddings Image Analysis

This lesson is part of a structured LLM course.

My Learning Path

Lesson 1

Vision LLMs

GPT-4V, Claude Vision

The Problem: AI can read text, but what about images? How can we make AI "see" and understand visual content like photos and screenshots?

The Solution: Teaching AI to See

How it works

When to use it and what to watch for

Think of it like describing a photo over the phone:

1. Input image: A 224×224 image enters the Vision Transformer (ViT)
2. Split into patches: Image is divided into 196 patches of 16×16 pixels each — like cutting a photo into a grid
3. Encode to visual tokens: Each patch becomes a visual token — 196 tokens total, like words in a sentence for the model
4. Self-attention: Tokens attend to each other: "this patch has a face, that patch has a hat — they belong together"
5. Merge with text: Visual tokens join text tokens. The model reasons over both to answer questions or describe the scene

Higher resolution = more patches = more tokens = higher cost. A 512×512 image produces ~1024 tokens. A 4K image can exceed 10,000 tokens.

Where Is This Used?

Image Description: Generating alt-text for accessibility
Document Analysis: Reading charts, forms, and screenshots
Visual Q&A: Answering questions about photos
Content Moderation: Detecting inappropriate images

Try It Yourself!

Use the interactive example below to see how AI can analyze and describe different types of images.

Want to optimize image costs? See the full cost calculator in the Multimodal Costs lesson.

Common Failure Modes

Model says 'a cat on the couch' but it's actually a cushion pattern
Model counts 12 eggs but there are only 11
Model reads 'OPEN' on a sign that actually says 'OPER'

Learn all 5 types of vision hallucinations →

Vision Language Models

How Vision LLMs Work

Image Input

Visual Encoder

Image Tokens

LLM Processing

JPEG, PNG, WebP, GIF

Image Token Costs

Low res

~85 tokens

512x512

Medium

~170 tokens

768x768

High res

~1500 tokens

2048x2048

Vision Model Comparison

GPT-5(OpenAI)

$5.00/1M tokens

Best visionVideo understandingComplex reasoning

Claude Sonnet 4(Anthropic)

$3.00/1M tokens

Image analysisDocument understandingCode from screenshots

GPT-4o(OpenAI)

$2.50/1M tokens

Image descriptionOCRObject detectionCharts/diagrams

Gemini 2.0 Pro(Google)

$1.25/1M tokens

Image + Video2M contextMulti-frame analysis

Qwen2-VL 72B(Alibaba (Open))

Free (self-hosted)

Open sourceSelf-hostedGood quality

Popular Use Cases

📄Document OCR & extraction

🛍️Product image analysis

🏥Medical image interpretation

📱UI/UX screenshot analysis

📊Chart & graph understanding

♿Accessibility descriptions

Best Practices

Size: Resize images to save tokens if details are not critical
Format: Use JPEG for photos, PNG for screenshots with text
Prompt: Be specific about what you want to know about the image
Multiple images: Number or describe images in your prompt