Lesson 2

Context Windows

Memory limits

The Problem: AI forgets the beginning of long conversations and can't process unlimited text. What limits AI memory, and how do you work within it?

The Solution: Understand Working Memory

The context window is the maximum amount of text an LLM can process in a single request — its working memory. Think of it like your computer's RAM: everything the model needs to actively reason about — instructions, the whole conversation so far, retrieved documents, and your latest question — has to fit inside this one fixed space. The size is measured in tokens (roughly ¾ of a word in English, often less in Russian), and managing your token budget is the core skill for staying within limits.

How it actually works

An LLM has no memory between requests. On every single turn, the entire prompt — the system prompt, all earlier messages, and the new input — is re-sent and re-read from scratch. The model is not "remembering" your chat; the application is replaying the full transcript each time. Because attention compares every token with every other token, compute cost grows roughly with the square of the length, which is why providers cap the window and why long prompts get slow and expensive. When the transcript would exceed the limit, the app must drop or compress something — usually the oldest messages first (FIFO), which is exactly why an assistant "forgets" how a long conversation began.

Tradeoffs, pitfalls, and a worked example

Bigger windows are not automatically better. "Needle in a haystack" tests show that recall is strongest at the very start and very end of a long context and weakest in the middle (the "lost in the middle" effect), so stuffing 200K tokens of background can actually hurt accuracy while inflating cost and latency. Concretely: imagine a 100-page PDF (~70K tokens) plus a 40-message support chat. Dumping all of it into one prompt is slow, pricey, and dilutes the signal. Instead you'd retrieve only the 3–4 relevant passages, summarize the older chat into a few lines, and keep the user's actual question near the end — fitting the same task into ~6K focused tokens. Put stable, reusable content (the system prompt, fixed instructions) at the top so prompt caching can reuse it cheaply across turns.

Think of it like computer RAM:

1. Limited size: 8K, 32K, 128K, 200K tokens depending on model
2. Includes everything: System prompt + conversation history + current message
3. FIFO when full: Oldest content gets dropped when limit reached
4. Cost scales with size: More tokens = more expensive

Managing Context

Summarization: Compress old conversation into summaries
Selective Inclusion: Only include relevant prior messages
RAG: Pull in relevant docs dynamically instead of storing everything
Chunking: Break long documents into processable pieces

Fun Fact: Context windows have grown from 4K tokens (GPT-3) to 200K+ tokens (Claude 3) in just a few years! But "needle in a haystack" tests show that attention quality degrades in very long contexts — bigger isn't always better.

Try It Yourself!

Use the interactive example below to see how context window limits affect AI memory and learn strategies to manage them.

Context Window — AI Memory

📦 Context window is the model's "memory". Everything that doesn't fit — gets forgotten! Add messages and watch the window fill up.

Choose model:

Context fill:400 / 4,096 tokens

System

User

Assistant

150tSystem prompt...

50tHello! Tell me about...

200tSure! Here is information...

Key Insight

When context overflows, old messages are "forgotten". That's why it's important to: 1) choose a model with enough context, 2) compress history, 3) keep important info closer to the end.

Frequently asked questions

What is an LLM context window?

The context window is the maximum amount of text, measured in tokens, that a model can process in a single request. It must hold everything at once: the system prompt, the full conversation history, any retrieved documents, and your current question. Anything that doesn't fit is invisible to the model — like data that won't fit in RAM.

Why does AI forget the start of a long conversation?

LLMs have no memory between requests; the entire transcript is re-sent every turn. When the total exceeds the window limit, the application drops or compresses the oldest messages first (FIFO) to stay within budget, so the assistant loses the earliest parts of a long chat.

How many tokens fit in a context window?

It varies by model: historically 4K for GPT-3, 8K–32K for early GPT-4, 128K for many current models, and 200K+ for Claude 3. Always check the specific model's documentation, since limits change with each release, and remember the window must cover both your prompt and the model's reply.

Is a bigger context window always better?

No. 'Needle in a haystack' tests reveal weaker recall for information in the middle of a long context (the 'lost in the middle' effect), and longer prompts raise cost and latency. It's often more accurate and cheaper to retrieve only the relevant chunks via RAG and summarize old history than to stuff everything in.

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerAnalysis

Processing a long document — information loss when context window overflows

Without technique

Prompt

[Document: 45-page corporate report, entire text pasted] ...(first 40 pages about finances)... ...(page 43: "Critical vulnerability found in authorization module")... ...(pages 44-45: appendices)... What critical issues are described in the report?

Response

The report describes company financial metrics, revenue growth, and development plans. No critical issues were found.

Tokens:12000/30

Time:2800ms

Quality:

With Context Windows

With technique

Prompt

Report split into sections. Key excerpts: [Section: Finance — summary] Revenue: +15% YoY, net profit: $23M. [Section: Security — full text] P. 43: Critical vulnerability found in authorization module. 12,000 accounts affected. Patch released Jan 15, fully deployed Jan 18. Recommendation: conduct follow-up audit in Q2. [Section: Appendices — skipped as irrelevant] What critical issues are described in the report?

Response

Critical issue: vulnerability in authorization module (p. 43). 12,000 accounts affected. Patch fully deployed Jan 18. Follow-up audit recommended for Q2.

👁️When pasting the full document, critical info on p. 43 "drowned" in the middle

🧠"Lost in the middle" effect — models recall the middle of long context poorly

🧠Splitting into sections + summarizing irrelevant parts saves tokens and focus

✅Strategic context management matters more than "more context = better"

Tokens:180/42

Time:320ms

Quality:

Why this works

More context is not always better. Strategic document chunking with summarization of irrelevant parts beats "paste everything and pray."

1 / 2

Practice Challenges

Create a free account to solve challenges

3 AI-verified challenges for this lesson

Related lessons:Tokenization Rag

This lesson is part of a structured LLM course.

My Learning Path

Lesson 2

Context Windows

Memory limits

The Problem: AI forgets the beginning of long conversations and can't process unlimited text. What limits AI memory, and how do you work within it?

The Solution: Understand Working Memory

How it actually works

Tradeoffs, pitfalls, and a worked example

Think of it like computer RAM:

1. Limited size: 8K, 32K, 128K, 200K tokens depending on model
2. Includes everything: System prompt + conversation history + current message
3. FIFO when full: Oldest content gets dropped when limit reached
4. Cost scales with size: More tokens = more expensive

Managing Context

Summarization: Compress old conversation into summaries
Selective Inclusion: Only include relevant prior messages
RAG: Pull in relevant docs dynamically instead of storing everything
Chunking: Break long documents into processable pieces

Try It Yourself!

Use the interactive example below to see how context window limits affect AI memory and learn strategies to manage them.

Context Window — AI Memory

📦 Context window is the model's "memory". Everything that doesn't fit — gets forgotten! Add messages and watch the window fill up.

Choose model:

Context fill:400 / 4,096 tokens

System

User

Assistant

150tSystem prompt...

50tHello! Tell me about...

200tSure! Here is information...

Key Insight

When context overflows, old messages are "forgotten". That's why it's important to: 1) choose a model with enough context, 2) compress history, 3) keep important info closer to the end.

Frequently asked questions

What is an LLM context window?

Why does AI forget the start of a long conversation?

How many tokens fit in a context window?

Is a bigger context window always better?

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerAnalysis

Processing a long document — information loss when context window overflows

Without technique

Prompt

Response

The report describes company financial metrics, revenue growth, and development plans. No critical issues were found.

Tokens:12000/30

Time:2800ms

Quality:

With Context Windows

With technique

Prompt

Response

Critical issue: vulnerability in authorization module (p. 43). 12,000 accounts affected. Patch fully deployed Jan 18. Follow-up audit recommended for Q2.

👁️When pasting the full document, critical info on p. 43 "drowned" in the middle

🧠"Lost in the middle" effect — models recall the middle of long context poorly

🧠Splitting into sections + summarizing irrelevant parts saves tokens and focus

✅Strategic context management matters more than "more context = better"

Tokens:180/42

Time:320ms

Quality:

Why this works

More context is not always better. Strategic document chunking with summarization of irrelevant parts beats "paste everything and pray."

1 / 2

Practice Challenges

Create a free account to solve challenges

3 AI-verified challenges for this lesson

Related lessons:Tokenization Rag

This lesson is part of a structured LLM course.

My Learning Path