Information Extraction
Structured data from text
The Problem: Important facts are buried in unstructured text like emails, documents, and web pages. How can we automatically pull out the data we need?
The Solution: A Detective Finding Clues
Information Extraction (IE) uses LLMs to identify and pull specific pieces of data out of unstructured text — the messy paragraphs in emails, PDFs, contracts, and web pages — and turn them into clean, machine-readable records. For example, from the raw text "The contract between Acme Corp and Bob Smith dated Jan 15 2024 for $50,000..." the model produces structured JSON: { party_a: "Acme Corp", party_b: "Bob Smith", date: "2024-01-15", amount: 50000 }. The two classic subtasks are named entity recognition (tagging the spans that are entities — people, organizations, dates, amounts) and relation extraction (connecting those entities — who signed what, who works where). The goal is almost always a structured output that downstream code can use directly, with no human re-typing.
How it works
In practice you give the model two things: an explicit output schema (the exact field names, types, and which fields are required) and the source text. With a strong instruction-tuned model you can often do this zero-shot, but accuracy jumps when you add a few few-shot examples that cover the edge cases — a missing phone number, two people with the same name, an amount written as "fifty thousand" instead of digits. Modern APIs let you enforce the schema at decoding time (JSON mode / constrained or "structured" output), which removes a whole class of parsing bugs because the response is guaranteed to be valid JSON that matches your shape. Compared with older rule-based systems (regular expressions and hand-written patterns), the LLM generalizes to phrasings it has never seen and handles context — but it costs more per document and is harder to fully predict.
Pitfalls and a worked example
The biggest risk is hallucination: when a field is absent, the model may invent a plausible value instead of leaving it null. The fix is to (1) tell it explicitly to return null for anything not stated, and (2) validate every extracted field against the source text before trusting it. Use IE when the data is recurring and high-volume (thousands of invoices, resumes, support tickets) and a few-percent error rate is acceptable with human review on the uncertain cases. As a concrete example, feed the model the email line "Hi, please reschedule my appointment from Tuesday to next Friday at 3pm, my reference is #A-4471" with a schema { action, old_date, new_date, time, reference }. A good prompt returns { action: "reschedule", old_date: "Tuesday", new_date: "next Friday", time: "15:00", reference: "A-4471" } — notice it normalized "3pm" to 24-hour time but kept the relative dates as written, because resolving "next Friday" to a calendar date needs context the text alone doesn't provide.
Think of it like a detective searching for clues:
- 1. Identify entity types: What to extract: people, organizations, dates, amounts, addresses
- 2. Define output schema (JSON): Specify exact field names, types, and nesting structure for the output
- 3. Provide examples with edge cases: Show how to handle missing fields, ambiguous entities, and multi-value fields
- 4. AI extracts entities: Model reads the text and fills the JSON schema with found values
- 5. Validate against source: Cross-check each extracted field — mark as VERIFIED (found in text) or UNVERIFIED (inferred)
Where Is This Used?
- Resume Parsing: Extracting skills, experience, contact info
- Invoice Processing: Pulling amounts, dates, vendor details
- Medical Records: Finding diagnoses, medications, dates
- Contract Analysis: Identifying terms, parties, obligations
Fun Fact: LLMs can extract information in complex relationships too! "John works at Acme, which was founded in 2010" — AI can understand that John's company was founded in 2010, even though it's not explicitly stated.
Try It Yourself!
Use the interactive example below to extract specific information from different types of text documents.
Apple Inc. announced that CEO Tim Cook will present the new iPhone at their headquarters in Cupertino on September 12, 2024. The company expects revenue of $90 billion.
- • NER — identifies named entities (Person, Org, Location, Date, Money) from unstructured text.
- • Relations — finds how entities connect: works_for, located_in, owns.
- • LLMs — do this zero-shot. No training data needed — just describe what to extract.
Frequently asked questions
What is Named Entity Recognition (NER)?
NER identifies and classifies named entities in text into predefined categories like person, organization, location, date, and monetary value. LLMs can perform NER zero-shot without training data.
Can LLMs extract data into JSON without fine-tuning?
Yes. With a well-structured prompt describing the desired schema, LLMs extract entities and output valid JSON. Use JSON mode or function calling for guaranteed format compliance.
How do I handle nested and overlapping entities?
Use hierarchical extraction: first identify top-level entities, then extract attributes for each. For overlapping entities, specify priority rules in your prompt.
What are LLM advantages over rule-based extraction?
LLMs handle ambiguity, context, and unseen patterns without writing rules. They generalize across domains and languages. Rule-based systems are faster and more predictable but brittle when text varies.
Try it yourself
Interactive demo of this technique
Extract key entities from a news article
The text mentions: Tesla, Elon Musk, Berlin, Robert Habeck. It also talks about factory construction and jobs.
[
{"entity": "Elon Musk", "category": "PERSON", "attributes": {"role": "Tesla CEO"}},
{"entity": "Robert Habeck", "category": "PERSON", "attributes": {"role": "German Economy Minister"}},
{"entity": "Tesla", "category": "ORG", "attributes": {"type": "automaker"}},
{"entity": "Berlin", "category": "LOCATION", "attributes": {"type": "city", "country": "Germany"}},
{"entity": "Germany", "category": "LOCATION", "attributes": {"type": "country"}},
{"entity": "March 15, 2025", "category": "DATE", "attributes": {"iso": "2025-03-15"}},
{"entity": "5 billion euros", "category": "MONEY", "attributes": {"amount": 5000000000, "currency": "EUR"}},
{"entity": "10,000", "category": "NUMBER", "attributes": {"value": 10000, "context": "jobs"}}
]
Defining categories with attributes and output format (JSON) transforms a vague entity list into structured data suitable for further processing.
Create a free account to solve challenges
5 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path