Image Analysis
Practical applications
The Problem: Beyond just describing images, we need AI that can deeply analyze visual content — identify patterns, compare images, and provide expert insights.
The Solution: From Pixels to Structured Data
While Vision Basics covers general image understanding (describing photos, visual Q&A), image analysis focuses on extracting structured, machine-usable data from documents, charts, medical images, and technical diagrams. It's the difference between describing a painting and reading a patient's X-ray: the goal isn't a pleasant sentence, it's a set of fields you can store, validate, and act on. Results are returned as structured output (JSON, tables) so a downstream program can consume them without parsing free-form prose.
How it works under the hood
A modern vision model doesn't run a separate OCR engine and then a language model. The image is split into patches, each patch is turned into an embedding, and those visual tokens are fed into the same transformer that processes text. Because the model reasons over pixels and your instructions jointly, it can read a number off a chart axis, associate it with the right bar, and place it into the JSON field you asked for — all in one pass. This is also why the prompt matters so much: a vague request like "describe this" leaves the model to guess what structure you want, while an explicit schema ({"invoice_no": ..., "total": ..., "line_items": [...]}) tells it exactly which fields to fill. The cost is roughly proportional to image resolution, since higher-resolution images become more visual tokens.
When to use it, and the main pitfall
Reach for image analysis whenever the "source of truth" lives in a picture rather than clean text: scanned invoices, lab reports, dashboards, ID cards, engineering schematics. The decisive tradeoff is accuracy versus trust: vision models can hallucinate plausible-looking values — a total that isn't printed, a date that "looks right" — and they are notoriously weak at precise counting and tiny low-contrast text. The defence is grounding: ask the model to copy values verbatim, to return null when a field is illegible, and to tag each field as [VERIFIED] or [UNVERIFIED]. Worked example: give a model a blurry receipt and the prompt "Extract merchant, date, and total as JSON; if any field is unreadable, set it to null and add a `confidence` 0–1." Instead of inventing "$42.00", a well-grounded model returns {"merchant": "Acme Cafe", "date": null, "total": 42.0, "confidence": 0.6} — and that null plus low confidence is exactly the signal that routes the document to a human instead of silently corrupting your database.
Think of it like a specialist reading an X-ray:
- 1. Identify document type: Is it a chart, a form, a medical scan, or a receipt? The prompt strategy differs for each
- 2. OCR + layout parsing: Extract text while preserving structure — columns, headers, table cells, not just raw text
- 3. Structured extraction: Ask for JSON output: {"patient": "...", "diagnosis": "...", "medications": [...]}
- 4. Validation & grounding: Mark extracted data as [VERIFIED] or [UNVERIFIED] — LLMs can hallucinate entity names from documents
Where Is This Used?
- Document Processing: Extract names, dates, amounts from scanned contracts, invoices, receipts — with structured JSON output
- Chart & Graph Reading: Interpret bar charts, line graphs, pie charts — extract data points and trends
- Medical Report Analysis: Parse lab results, radiology reports — extract diagnosis, measurements, recommendations
- Technical Diagrams: Read architecture diagrams, flowcharts, circuit schematics — describe components and connections
Fun Fact: Vision models can now spot things humans might miss! In medical imaging, AI has detected early-stage cancers that radiologists overlooked. The combination of AI + human review is often more accurate than either alone.
Try It Yourself!
Use the interactive example below to perform detailed analysis on different types of images and see the depth of AI understanding.
Prompt Quality Matters
Generic prompt
"Describe this image"Result:
This is a medical form with patient information and test results.
Structured prompt
"Extract from this medical form: 1) Patient name 2) Date 3) All test results as JSON {test: value, unit, reference_range}"Result:
{"patient": "Jane Doe", "date": "2025-01-15", "results": [{"test": "Glucose", "value": 95, "unit": "mg/dL", "range": "70-100"}]}For advanced OCR techniques — table extraction, multi-page documents, and handwritten text — see Document Understanding.
Confidence Markers
Always ask the model to mark extracted data with confidence levels. This helps catch hallucinated values.
For each extracted field, mark as: [VERIFIED] — clearly visible in the image [UNVERIFIED] — partially visible or inferred [NOT_FOUND] — not present in the image
Frequently asked questions
How is image analysis different from describing a photo?
Describing a photo gives free-form text ("a coffee cup on a desk"), while image analysis extracts structured data — specific fields as JSON or a table that you can store, validate, and process programmatically. The goal isn't a nice sentence, it's a machine-usable set of values: merchant, date, total, lab results.
How do I make a vision model return data as JSON?
Give an explicit schema in the prompt: list the fields and their format, e.g. "Extract from this receipt: merchant, date (YYYY-MM-DD), and total as JSON {merchant, date, total}". The more concrete the structure, the less the model guesses. Many APIs also offer a structured output / json_mode that guarantees valid JSON.
Why does the model invent values that aren't in the image?
That's hallucination: the model fills in a "plausible" answer, especially when text is blurry, tiny, or low-contrast, and vision models are weak at precise counting. The defence is grounding: ask it to copy values verbatim, return null for illegible fields, and tag each field as [VERIFIED] or [UNVERIFIED].
Can I trust AI to extract data from medical or financial documents?
Only as a first draft with mandatory human review on critical fields. The best practice is AI + review: the model extracts data and attaches a confidence level, and any low-confidence or null fields are routed to a human. This human-in-the-loop combination is more accurate than either alone and keeps errors out of your database.
Try it yourself
Interactive demo of this technique
OCR and document understanding: raw text extraction vs structured analysis
GROCERY STORE 24 MILK 2.5% 89.90 BORODINSKY BREAD 65.00 RUSSIAN CHEESE 450G 389.00 BANANAS 1.2KG 95.88 EGGS C1 10PCS 109.00 TOTAL 748.78 CARD PAYMENT THANK YOU FOR YOUR PURCHASE
1. Store: Grocery Store 24
2. Items:
[
{"name": "Milk 2.5%", "quantity": 1, "unit": "pcs", "price_per_unit": 89.90, "total": 89.90},
{"name": "Borodinsky Bread", "quantity": 1, "unit": "pcs", "price_per_unit": 65.00, "total": 65.00},
{"name": "Russian Cheese", "quantity": 1, "unit": "450g", "price_per_unit": 389.00, "total": 389.00},
{"name": "Bananas", "quantity": 1.2, "unit": "kg", "price_per_unit": 79.90, "total": 95.88},
{"name": "Eggs C1", "quantity": 10, "unit": "pcs", "price_per_unit": 10.90, "total": 109.00}
]
3. Totals: {"subtotal": 748.78, "tax": 0, "total": 748.78, "payment_method": "card"}
4. Date: Not visible on the receipt.
5. Verification: 89.90 + 65.00 + 389.00 + 95.88 + 109.00 = 748.78 ✓ Matches.
OCR without structure is just text. A prompt with JSON schema + verification (sum = total?) turns a document photo into data ready for system import.
Create a free account to solve challenges
4 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path