Information Extraction
Structured data from text
The Problem: Important facts are buried in unstructured text like emails, documents, and web pages. How can we automatically pull out the data we need?
The Solution: A Detective Finding Clues
Information Extraction uses LLMs to identify and pull specific pieces of data from unstructured text. For example, from raw text "The contract between Acme Corp and Bob Smith dated Jan 15 2024 for $50,000..." the AI extracts structured JSON: { party_a: "Acme Corp", party_b: "Bob Smith", date: "2024-01-15", amount: 50000 }. A key subtask is named entity recognition, and results are best output as structured output.
Think of it like a detective searching for clues:
- 1. Identify entity types: What to extract: people, organizations, dates, amounts, addresses
- 2. Define output schema (JSON): Specify exact field names, types, and nesting structure for the output
- 3. Provide examples with edge cases: Show how to handle missing fields, ambiguous entities, and multi-value fields
- 4. AI extracts entities: Model reads the text and fills the JSON schema with found values
- 5. Validate against source: Cross-check each extracted field — mark as VERIFIED (found in text) or UNVERIFIED (inferred)
Where Is This Used?
- Resume Parsing: Extracting skills, experience, contact info
- Invoice Processing: Pulling amounts, dates, vendor details
- Medical Records: Finding diagnoses, medications, dates
- Contract Analysis: Identifying terms, parties, obligations
Fun Fact: LLMs can extract information in complex relationships too! "John works at Acme, which was founded in 2010" — AI can understand that John's company was founded in 2010, even though it's not explicitly stated.
Try It Yourself!
Use the interactive example below to extract specific information from different types of text documents.
Apple Inc. announced that CEO Tim Cook will present the new iPhone at their headquarters in Cupertino on September 12, 2024. The company expects revenue of $90 billion.
- • NER — identifies named entities (Person, Org, Location, Date, Money) from unstructured text.
- • Relations — finds how entities connect: works_for, located_in, owns.
- • LLMs — do this zero-shot. No training data needed — just describe what to extract.
Try it yourself
Interactive demo of this technique
Extract key entities from a news article
The text mentions: Tesla, Elon Musk, Berlin, Robert Habeck. It also talks about factory construction and jobs.
[
{"entity": "Elon Musk", "category": "PERSON", "attributes": {"role": "Tesla CEO"}},
{"entity": "Robert Habeck", "category": "PERSON", "attributes": {"role": "German Economy Minister"}},
{"entity": "Tesla", "category": "ORG", "attributes": {"type": "automaker"}},
{"entity": "Berlin", "category": "LOCATION", "attributes": {"type": "city", "country": "Germany"}},
{"entity": "Germany", "category": "LOCATION", "attributes": {"type": "country"}},
{"entity": "March 15, 2025", "category": "DATE", "attributes": {"iso": "2025-03-15"}},
{"entity": "5 billion euros", "category": "MONEY", "attributes": {"amount": 5000000000, "currency": "EUR"}},
{"entity": "10,000", "category": "NUMBER", "attributes": {"value": 10000, "context": "jobs"}}
]
Defining categories with attributes and output format (JSON) transforms a vague entity list into structured data suitable for further processing.
Create a free account to solve challenges
5 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path