Document Understanding
From scan to structured data
The Problem: Traditional OCR gives you raw text with broken formatting, split words, and misread characters. Turning that into structured, validated data requires a second step — and LLMs excel at it.
The Solution: From Pixels to Validated Data
Document understanding is the task of turning a page — an invoice, a contract, a lab report — into structured, machine-readable data while preserving its meaning. Traditional OCR only solves half of the problem: it extracts characters from pixels but loses structure. Columns merge, tables break, multi-line addresses collapse, and a smudged digit becomes the wrong number. The output is a flat wall of text where you can no longer tell which value is the total and which is a line item. LLM-based document understanding goes further: it reads, corrects, and structures the content in one reasoning pass. The model understands that "$I2.99" is almost certainly "$12.99", that "TechParts L LC" is "TechParts LLC", and that the bold number at the bottom is the grand total, not just another row.
How it works
The pipeline runs in stages: raw text extraction (OCR or, in newer vision-language models, reading the page image directly as visual tokens), then LLM correction and structuring, then schema-based extraction into a target shape like JSON. The crucial last stage is validation. Because an LLM will happily guess a missing field, every extracted value gets a confidence marker — each field is tagged [VERIFIED] or [UNVERIFIED] — so downstream accounting or compliance systems know what to trust and what a human should double-check. You also cross-check arithmetic (do the line items sum to the stated total?) to catch silent hallucinations before they reach a database.
When to use it, and the tradeoffs
Reach for this approach when documents are semi-structured and varied — thousands of invoices in dozens of vendor layouts, where writing a rigid template parser for each one is hopeless. The flexibility is the win: one prompt handles formats you have never seen. The tradeoffs are real, though. LLMs cost more per page than classic OCR, add latency, and can fabricate plausible-but-wrong values, so they are risky for high-stakes fields without validation. As a concrete example: feed a scanned receipt where OCR returns Subtota1 $4S.00 ... Tota1 $48.60. The model corrects the garbled characters to Subtotal $45.00 and Total $48.60, infers the missing $3.60 tax line, marks the subtotal and total as [VERIFIED] (they reconcile), and flags the inferred tax as [UNVERIFIED] for a human to confirm.
Think of it like a paralegal reading contracts for a law firm:
- 1. Raw OCR extraction: Scan the document image and extract all text — expect errors: split words, misread characters, broken layout
- 2. LLM structuring & correction: LLM reads the raw OCR, fixes errors (split words, misread digits), and organizes into logical sections
- 3. Schema-based extraction: Apply a target JSON schema to extract specific fields — company, amounts, dates, line items
- 4. Validation & confidence marking: Tag each field as [VERIFIED] or [UNVERIFIED] — cross-check totals, flag inferred values, catch hallucinations
Where Is This Used?
- Invoice & Receipt Processing: Extract vendor, amounts, line items, tax from scanned invoices — output structured JSON for accounting systems
- Contract Analysis: Parse clauses, dates, parties, obligations from legal documents with confidence markers for uncertain readings
- Medical Records: Extract diagnoses, medications, lab values from handwritten or printed medical forms
- Technical Drawings & Schematics: Read dimensions, labels, part numbers from engineering drawings and circuit diagrams
Fun Fact: OCR-free models like mPLUG-DocOwl2 and Docopilot (CVPR 2025) skip the OCR step entirely — they process the document image directly as visual tokens. This eliminates OCR errors at the source, though they still need grounding prompts to avoid hallucinating content.
Try It Yourself!
See the interactive pipeline below: watch how a messy OCR output transforms into clean, validated JSON through LLM structuring.
Frequently asked questions
How is LLM document understanding different from regular OCR?
Plain OCR only recognizes characters in an image and returns flat text with no structure: columns merge, tables break, and smudged digits get misread. LLM document understanding goes further — it reads that raw text, fixes recognition errors, restores structure, and extracts the fields you need into a format like JSON against a target schema. In short, OCR answers 'what characters are here', while the LLM answers 'what these characters mean and how to organize them into fields'.
How do you extract structured data (JSON) from an invoice or PDF with an LLM?
The pipeline usually has four steps: 1) raw text extraction via OCR or a vision-language model that reads the page directly; 2) LLM correction — the model fixes split words and misread digits; 3) schema-based extraction — you define a target JSON shape (vendor, amounts, dates, line items) and the model fills it; 4) validation — cross-check the arithmetic and tag each field [VERIFIED] or [UNVERIFIED]. The key is always providing an explicit schema and verifying totals, otherwise the model may fabricate missing values.
Can an LLM make mistakes reading documents, and how do you control for it?
Yes. An LLM will happily guess a missing or unreadable field and emit a plausible-but-wrong value — a hallucination. You control this with confidence markers: each field is tagged [VERIFIED] (read and reconciled during cross-checks) or [UNVERIFIED] (inferred or doubtful). You also validate arithmetic — for example, do the line items sum to the stated total. For high-stakes fields like amounts or IDs, [UNVERIFIED] values are routed to a human for confirmation.
When should you use an LLM for document processing instead of a template parser?
LLMs win when documents are semi-structured and varied — thousands of invoices across many vendor layouts where writing a rigid template for each format is hopeless. One prompt handles formats you have never seen. The downsides: LLMs cost more per page, add latency, and can err without validation. If you have a single stable format with fixed layout, a classic template parser is cheaper, faster, and more predictable.
Try it yourself
Interactive demo of this technique
Extract structured data from a document scan instead of raw text
TechParts L LC I NV-2024-0847 2024-03-15 Wid get A 50 649.50 Connector B 100 350.00 Cable Set C 25 700.00 Tax 1,869.45
{ "company": "TechParts LLC", // [VERIFIED] "invoice_number": "INV-2024-0847", // [VERIFIED] "date": "2024-03-15", // [VERIFIED] "items": [ {"name": "Widget A", "qty": 50, "unit_price": 12.99, "total": 649.50}, {"name": "Connector B", "qty": 100, "unit_price": 3.50, "total": 350.00}, {"name": "Cable Set C", "qty": 25, "unit_price": 28.00, "total": 700.00} ], "tax": 169.95, // [VERIFIED] "tax_rate": 0.10, // [UNVERIFIED — inferred] "grand_total": 1869.45 // [VERIFIED] }
JSON schema + correction instruction + confidence markers turn raw OCR into production-ready data. The model fixes errors and honestly marks assumptions.
Create a free account to solve challenges
3 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path