ApplicationNER

Information Extraction

Structured data from text

The Problem: Important facts are buried in unstructured text like emails, documents, and web pages. How can we automatically pull out the data we need?

The Solution: A Detective Finding Clues

Information Extraction (IE) uses LLMs to identify and pull specific pieces of data out of unstructured text — the messy paragraphs in emails, PDFs, contracts, and web pages — and turn them into clean, machine-readable records. For example, from the raw text "The contract between Acme Corp and Bob Smith dated Jan 15 2024 for $50,000..." the model produces structured JSON: { party_a: "Acme Corp", party_b: "Bob Smith", date: "2024-01-15", amount: 50000 }. The two classic subtasks are named entity recognition (tagging the spans that are entities — people, organizations, dates, amounts) and relation extraction (connecting those entities — who signed what, who works where). The goal is almost always a structured output that downstream code can use directly, with no human re-typing.

How it works

In practice you give the model two things: an explicit output schema (the exact field names, types, and which fields are required) and the source text. With a strong instruction-tuned model you can often do this zero-shot, but accuracy jumps when you add a few few-shot examples that cover the edge cases — a missing phone number, two people with the same name, an amount written as "fifty thousand" instead of digits. Modern APIs let you enforce the schema at decoding time (JSON mode / constrained or "structured" output), which removes a whole class of parsing bugs because the response is guaranteed to be valid JSON that matches your shape. Compared with older rule-based systems (regular expressions and hand-written patterns), the LLM generalizes to phrasings it has never seen and handles context — but it costs more per document and is harder to fully predict.

Pitfalls and a worked example

The biggest risk is hallucination: when a field is absent, the model may invent a plausible value instead of leaving it null. The fix is to (1) tell it explicitly to return null for anything not stated, and (2) validate every extracted field against the source text before trusting it. Use IE when the data is recurring and high-volume (thousands of invoices, resumes, support tickets) and a few-percent error rate is acceptable with human review on the uncertain cases. As a concrete example, feed the model the email line "Hi, please reschedule my appointment from Tuesday to next Friday at 3pm, my reference is #A-4471" with a schema { action, old_date, new_date, time, reference }. A good prompt returns { action: "reschedule", old_date: "Tuesday", new_date: "next Friday", time: "15:00", reference: "A-4471" } — notice it normalized "3pm" to 24-hour time but kept the relative dates as written, because resolving "next Friday" to a calendar date needs context the text alone doesn't provide.

Think of it like a detective searching for clues:

1. Identify entity types: What to extract: people, organizations, dates, amounts, addresses
2. Define output schema (JSON): Specify exact field names, types, and nesting structure for the output
3. Provide examples with edge cases: Show how to handle missing fields, ambiguous entities, and multi-value fields
4. AI extracts entities: Model reads the text and fills the JSON schema with found values
5. Validate against source: Cross-check each extracted field — mark as VERIFIED (found in text) or UNVERIFIED (inferred)

Where Is This Used?

Resume Parsing: Extracting skills, experience, contact info
Invoice Processing: Pulling amounts, dates, vendor details
Medical Records: Finding diagnoses, medications, dates
Contract Analysis: Identifying terms, parties, obligations

Fun Fact: LLMs can extract information in complex relationships too! "John works at Acme, which was founded in 2010" — AI can understand that John's company was founded in 2010, even though it's not explicitly stated.

Try It Yourself!

Use the interactive example below to extract specific information from different types of text documents.

Business NewsEntity 1/5

Apple Inc. announced that CEO Tim Cook will present the new iPhone at their headquarters in Cupertino on September 12, 2024. The company expects revenue of $90 billion.

What type of entity is "Apple Inc."?

PersonOrganizationLocationDateMoney

Key Insight

• NER — identifies named entities (Person, Org, Location, Date, Money) from unstructured text.
• Relations — finds how entities connect: works_for, located_in, owns.
• LLMs — do this zero-shot. No training data needed — just describe what to extract.

Frequently asked questions

What is Named Entity Recognition (NER)?

NER identifies and classifies named entities in text into predefined categories like person, organization, location, date, and monetary value. LLMs can perform NER zero-shot without training data.

Can LLMs extract data into JSON without fine-tuning?

Yes. With a well-structured prompt describing the desired schema, LLMs extract entities and output valid JSON. Use JSON mode or function calling for guaranteed format compliance.

How do I handle nested and overlapping entities?

Use hierarchical extraction: first identify top-level entities, then extract attributes for each. For overlapping entities, specify priority rules in your prompt.

What are LLM advantages over rule-based extraction?

LLMs handle ambiguity, context, and unseen patterns without writing rules. They generalize across domains and languages. Rule-based systems are faster and more predictable but brittle when text varies.

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerAnalysis

Extract key entities from a news article

Without technique

Prompt

What entities are mentioned in the text? "Tesla, led by Elon Musk, announced on March 15, 2025, the construction of a new factory in Berlin worth 5 billion euros. According to German Economy Minister Robert Habeck, the project will create 10,000 jobs."

Response

The text mentions: Tesla, Elon Musk, Berlin, Robert Habeck. It also talks about factory construction and jobs.

Tokens:85/40

Time:420ms

Quality:

With Data Extraction

With technique

Prompt

Extract named entities from the text and return the result as JSON. Categories to extract: - PERSON: name, role/title - ORG: organization name, type - LOCATION: name, type (city/country) - DATE: date, ISO format - MONEY: amount, currency - NUMBER: value, context Text: "Tesla, led by Elon Musk, announced on March 15, 2025, the construction of a new factory in Berlin worth 5 billion euros. According to German Economy Minister Robert Habeck, the project will create 10,000 jobs." Return array: [{entity, category, attributes}]. Do not add entities not present in the text.

Response

[
  {"entity": "Elon Musk", "category": "PERSON", "attributes": {"role": "Tesla CEO"}},
  {"entity": "Robert Habeck", "category": "PERSON", "attributes": {"role": "German Economy Minister"}},
  {"entity": "Tesla", "category": "ORG", "attributes": {"type": "automaker"}},
  {"entity": "Berlin", "category": "LOCATION", "attributes": {"type": "city", "country": "Germany"}},
  {"entity": "Germany", "category": "LOCATION", "attributes": {"type": "country"}},
  {"entity": "March 15, 2025", "category": "DATE", "attributes": {"iso": "2025-03-15"}},
  {"entity": "5 billion euros", "category": "MONEY", "attributes": {"amount": 5000000000, "currency": "EUR"}},
  {"entity": "10,000", "category": "NUMBER", "attributes": {"value": 10000, "context": "jobs"}}
]

👁️Listing categories (PERSON, ORG, DATE...) sets a taxonomy for complete extraction

🧠Attributes in each category (role, type, ISO format) ensure extraction depth

🔍Constraint "do not add entities not present" prevents hallucinations

Tokens:165/220

Time:950ms

Quality:

Why this works

Defining categories with attributes and output format (JSON) transforms a vague entity list into structured data suitable for further processing.

1 / 2

Practice Challenges

Create a free account to solve challenges

5 AI-verified challenges for this lesson

Related lessons:Structured Output Classification

This lesson is part of a structured LLM course.

My Learning Path

ApplicationNER

Information Extraction

Structured data from text

The Problem: Important facts are buried in unstructured text like emails, documents, and web pages. How can we automatically pull out the data we need?

The Solution: A Detective Finding Clues

How it works

Pitfalls and a worked example

Think of it like a detective searching for clues:

1. Identify entity types: What to extract: people, organizations, dates, amounts, addresses
2. Define output schema (JSON): Specify exact field names, types, and nesting structure for the output
3. Provide examples with edge cases: Show how to handle missing fields, ambiguous entities, and multi-value fields
4. AI extracts entities: Model reads the text and fills the JSON schema with found values
5. Validate against source: Cross-check each extracted field — mark as VERIFIED (found in text) or UNVERIFIED (inferred)

Where Is This Used?

Resume Parsing: Extracting skills, experience, contact info
Invoice Processing: Pulling amounts, dates, vendor details
Medical Records: Finding diagnoses, medications, dates
Contract Analysis: Identifying terms, parties, obligations

Try It Yourself!

Use the interactive example below to extract specific information from different types of text documents.

Business NewsEntity 1/5

Apple Inc. announced that CEO Tim Cook will present the new iPhone at their headquarters in Cupertino on September 12, 2024. The company expects revenue of $90 billion.

What type of entity is "Apple Inc."?

PersonOrganizationLocationDateMoney

Key Insight

• NER — identifies named entities (Person, Org, Location, Date, Money) from unstructured text.
• Relations — finds how entities connect: works_for, located_in, owns.
• LLMs — do this zero-shot. No training data needed — just describe what to extract.

Frequently asked questions

What is Named Entity Recognition (NER)?

NER identifies and classifies named entities in text into predefined categories like person, organization, location, date, and monetary value. LLMs can perform NER zero-shot without training data.

Can LLMs extract data into JSON without fine-tuning?

Yes. With a well-structured prompt describing the desired schema, LLMs extract entities and output valid JSON. Use JSON mode or function calling for guaranteed format compliance.

How do I handle nested and overlapping entities?

Use hierarchical extraction: first identify top-level entities, then extract attributes for each. For overlapping entities, specify priority rules in your prompt.

What are LLM advantages over rule-based extraction?

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerAnalysis

Extract key entities from a news article

Without technique

Prompt

Response

The text mentions: Tesla, Elon Musk, Berlin, Robert Habeck. It also talks about factory construction and jobs.

Tokens:85/40

Time:420ms

Quality:

With Data Extraction

With technique

Prompt

Response

[
  {"entity": "Elon Musk", "category": "PERSON", "attributes": {"role": "Tesla CEO"}},
  {"entity": "Robert Habeck", "category": "PERSON", "attributes": {"role": "German Economy Minister"}},
  {"entity": "Tesla", "category": "ORG", "attributes": {"type": "automaker"}},
  {"entity": "Berlin", "category": "LOCATION", "attributes": {"type": "city", "country": "Germany"}},
  {"entity": "Germany", "category": "LOCATION", "attributes": {"type": "country"}},
  {"entity": "March 15, 2025", "category": "DATE", "attributes": {"iso": "2025-03-15"}},
  {"entity": "5 billion euros", "category": "MONEY", "attributes": {"amount": 5000000000, "currency": "EUR"}},
  {"entity": "10,000", "category": "NUMBER", "attributes": {"value": 10000, "context": "jobs"}}
]

👁️Listing categories (PERSON, ORG, DATE...) sets a taxonomy for complete extraction

🧠Attributes in each category (role, type, ISO format) ensure extraction depth

🔍Constraint "do not add entities not present" prevents hallucinations

Tokens:165/220

Time:950ms

Quality:

Why this works

Defining categories with attributes and output format (JSON) transforms a vague entity list into structured data suitable for further processing.

1 / 2

Practice Challenges

Create a free account to solve challenges

5 AI-verified challenges for this lesson

Related lessons:Structured Output Classification

This lesson is part of a structured LLM course.

My Learning Path