Attention Mechanism
What to focus on
The Problem: When you read "The bank by the river was steep", you instantly know "bank" means a riverbank, not a financial institution. But how can a model that processes words as numbers understand that the same word means different things in different contexts?
The Solution: Attention — The Model Learns to Focus
Attention (attention mechanism) allows the model to look at all other words when processing each word, deciding which ones are most important for understanding the current word. It's like a spotlight that illuminates the relevant parts of a sentence.
For the word "bank", the attention mechanism will focus on "river" and "steep" — and understand it's a riverbank. In another sentence ("I went to the bank to deposit money"), it would focus on "deposit" and "money" — and understand it's a financial institution.
Attention is the key building block of the Transformer architecture, which powers all modern LLMs. Before attention, each word was represented by a fixed embedding that didn't change based on context.
Think of it like a librarian helping you find the right book:
- 1. Query — "What am I looking for?": The current word creates a "question" vector. For the word "bank", the query is essentially: "What context should I pay attention to?"
- 2. Key — "What do I have to offer?": Every other word creates a "label" vector describing its content. "River" says: "I'm about nature and water." "Money" says: "I'm about finance."
- 3. Match — Compare Query with all Keys: The model calculates how well each Key matches the Query. High match = high attention score. "River" gets a high score when "bank" is about terrain.
- 4. Value — Gather the relevant information: Each word also has a "content" vector (Value). The model takes a weighted combination of all Values based on the attention scores — mostly from the high-scoring words.
- 5. Multi-Head — Multiple perspectives at once: One attention "head" might focus on grammar (subject-verb), another on meaning (river-bank), a third on position (nearby words). GPT-4 uses 96 heads in each layer!
This is called self-attention because the model attends to its own input — each word asks about all other words in the same text.
Why Is Attention So Important?
- Context understanding: the same word gets different representations depending on surrounding words — "bank" becomes "riverbank" or "financial bank"
- Long-range connections: attention can connect words far apart in text — a pronoun "she" at position 100 can attend to a name at position 5
- Parallel processing: unlike older models (RNNs) that read word by word, attention processes ALL words at once on GPUs — making training much faster
Fun Fact: The famous paper "Attention Is All You Need" (2017) showed that you don't need complex recurrent networks — attention alone is enough to build state-of-the-art language models. This single idea spawned GPT, BERT, T5, and eventually ChatGPT and Claude!
Try It Yourself!
Below is an interactive visualization. See how different words "pay attention" to each other — hover over a word to see its attention pattern!
👁️ How does AI understand context?
AI looks at other words in the sentence to understand the meaning of each word.
💡 Key insight:
- •Attention is a mechanism that allows AI to "look at" different parts of the text.
- •Attention strength shows how important one word is for understanding another.
- •It's like reading a sentence and remembering key words.
Deep Dive: How Attention Works
Query, Key, Value — an Analogy
Imagine a library. You come with a question (Query): "I need a book about space." Each book on the shelf has a label (Key): "Astronomy," "Cooking," "Physics." The attention mechanism compares your question with each label to find the most relevant books. Then it takes the content (Value) of relevant books and combines them for an answer.
Why Multiple Heads (Multi-Head)?
A single attention head sees text from only one angle. It's like having only one librarian who searches only by topic. Multi-head attention is several "librarians" working simultaneously: one searches by topic, another by author, a third by publication year. Together they provide a more complete picture of word relationships: grammatical, semantic, and positional.
Try it yourself
Interactive demo of this technique
Extract the key requirement from a long technical specification
Main requirements for the task management system:
- Task creation and assignment
- Calendar integration
- Email notifications
- Mobile app
- Analytics dashboard
- PDF export
- Multi-language
- Dark theme
Key architectural requirement: OFFLINE-FIRST.
This defines everything: need a local database (SQLite/IndexedDB), CRDT or conflict resolution for sync, Service Worker for PWA. Other features (calendar, PDF, notifications) are standard on any stack. But offline-sync fundamentally changes architecture choice.
The transformer attention mechanism "sees" the start and end of the prompt more strongly (primacy/recency). Place critical information in these positions, not in the middle.
Create a free account to solve challenges
4 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path