Context Windows
Memory limits
The Problem: AI forgets the beginning of long conversations and can't process unlimited text. What limits AI memory, and how do you work within it?
The Solution: Understand Working Memory
The context window is the maximum amount of text an LLM can process in a single request — its working memory. Think of it like your computer's RAM: everything the model needs to actively reason about — instructions, the whole conversation so far, retrieved documents, and your latest question — has to fit inside this one fixed space. The size is measured in tokens (roughly ¾ of a word in English, often less in Russian), and managing your token budget is the core skill for staying within limits.
How it actually works
An LLM has no memory between requests. On every single turn, the entire prompt — the system prompt, all earlier messages, and the new input — is re-sent and re-read from scratch. The model is not "remembering" your chat; the application is replaying the full transcript each time. Because attention compares every token with every other token, compute cost grows roughly with the square of the length, which is why providers cap the window and why long prompts get slow and expensive. When the transcript would exceed the limit, the app must drop or compress something — usually the oldest messages first (FIFO), which is exactly why an assistant "forgets" how a long conversation began.
Tradeoffs, pitfalls, and a worked example
Bigger windows are not automatically better. "Needle in a haystack" tests show that recall is strongest at the very start and very end of a long context and weakest in the middle (the "lost in the middle" effect), so stuffing 200K tokens of background can actually hurt accuracy while inflating cost and latency. Concretely: imagine a 100-page PDF (~70K tokens) plus a 40-message support chat. Dumping all of it into one prompt is slow, pricey, and dilutes the signal. Instead you'd retrieve only the 3–4 relevant passages, summarize the older chat into a few lines, and keep the user's actual question near the end — fitting the same task into ~6K focused tokens. Put stable, reusable content (the system prompt, fixed instructions) at the top so prompt caching can reuse it cheaply across turns.
Think of it like computer RAM:
- 1. Limited size: 8K, 32K, 128K, 200K tokens depending on model
- 2. Includes everything: System prompt + conversation history + current message
- 3. FIFO when full: Oldest content gets dropped when limit reached
- 4. Cost scales with size: More tokens = more expensive
Managing Context
- Summarization: Compress old conversation into summaries
- Selective Inclusion: Only include relevant prior messages
- RAG: Pull in relevant docs dynamically instead of storing everything
- Chunking: Break long documents into processable pieces
Fun Fact: Context windows have grown from 4K tokens (GPT-3) to 200K+ tokens (Claude 3) in just a few years! But "needle in a haystack" tests show that attention quality degrades in very long contexts — bigger isn't always better.
Try It Yourself!
Use the interactive example below to see how context window limits affect AI memory and learn strategies to manage them.
📦 Context window is the model's "memory". Everything that doesn't fit — gets forgotten! Add messages and watch the window fill up.
When context overflows, old messages are "forgotten". That's why it's important to: 1) choose a model with enough context, 2) compress history, 3) keep important info closer to the end.
Frequently asked questions
What is an LLM context window?
The context window is the maximum amount of text, measured in tokens, that a model can process in a single request. It must hold everything at once: the system prompt, the full conversation history, any retrieved documents, and your current question. Anything that doesn't fit is invisible to the model — like data that won't fit in RAM.
Why does AI forget the start of a long conversation?
LLMs have no memory between requests; the entire transcript is re-sent every turn. When the total exceeds the window limit, the application drops or compresses the oldest messages first (FIFO) to stay within budget, so the assistant loses the earliest parts of a long chat.
How many tokens fit in a context window?
It varies by model: historically 4K for GPT-3, 8K–32K for early GPT-4, 128K for many current models, and 200K+ for Claude 3. Always check the specific model's documentation, since limits change with each release, and remember the window must cover both your prompt and the model's reply.
Is a bigger context window always better?
No. 'Needle in a haystack' tests reveal weaker recall for information in the middle of a long context (the 'lost in the middle' effect), and longer prompts raise cost and latency. It's often more accurate and cheaper to retrieve only the relevant chunks via RAG and summarize old history than to stuff everything in.
Try it yourself
Interactive demo of this technique
Processing a long document — information loss when context window overflows
The report describes company financial metrics, revenue growth, and development plans. No critical issues were found.
Critical issue: vulnerability in authorization module (p. 43). 12,000 accounts affected. Patch fully deployed Jan 18. Follow-up audit recommended for Q2.
More context is not always better. Strategic document chunking with summarization of irrelevant parts beats "paste everything and pray."
Create a free account to solve challenges
3 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path