Lesson 3

Attention Mechanism

What to focus on

The Problem: When you read "The bank by the river was steep", you instantly know "bank" means a riverbank, not a financial institution. But how can a model that processes words as numbers understand that the same word means different things in different contexts?

The Solution: Attention — The Model Learns to Focus

Attention (attention mechanism) allows the model to look at all other words when processing each word, deciding which ones are most important for understanding the current word. It's like a spotlight that illuminates the relevant parts of a sentence.

For the word "bank", the attention mechanism will focus on "river" and "steep" — and understand it's a riverbank. In another sentence ("I went to the bank to deposit money"), it would focus on "deposit" and "money" — and understand it's a financial institution.

Attention is the key building block of the Transformer architecture, which powers all modern LLMs. Before attention, each word was represented by a fixed embedding that didn't change based on context.

Think of it like a librarian helping you find the right book:

1. Query — "What am I looking for?": The current word creates a "question" vector. For the word "bank", the query is essentially: "What context should I pay attention to?"
2. Key — "What do I have to offer?": Every other word creates a "label" vector describing its content. "River" says: "I'm about nature and water." "Money" says: "I'm about finance."
3. Match — Compare Query with all Keys: The model calculates how well each Key matches the Query. High match = high attention score. "River" gets a high score when "bank" is about terrain.
4. Value — Gather the relevant information: Each word also has a "content" vector (Value). The model takes a weighted combination of all Values based on the attention scores — mostly from the high-scoring words.
5. Multi-Head — Multiple perspectives at once: One attention "head" might focus on grammar (subject-verb), another on meaning (river-bank), a third on position (nearby words). GPT-4 uses 96 heads in each layer!

This is called self-attention because the model attends to its own input — each word asks about all other words in the same text.

Why Is Attention So Important?

Context understanding: the same word gets different representations depending on surrounding words — "bank" becomes "riverbank" or "financial bank"
Long-range connections: attention can connect words far apart in text — a pronoun "she" at position 100 can attend to a name at position 5
Parallel processing: unlike older models (RNNs) that read word by word, attention processes ALL words at once on GPUs — making training much faster

Fun Fact: The famous paper "Attention Is All You Need" (2017) showed that you don't need complex recurrent networks — attention alone is enough to build state-of-the-art language models. This single idea spawned GPT, BERT, T5, and eventually ChatGPT and Claude!

Try It Yourself!

Below is an interactive visualization. See how different words "pay attention" to each other — hover over a word to see its attention pattern!

👁️ How does AI understand context?

AI looks at other words in the sentence to understand the meaning of each word.

Example 1/3

The

cat

sat

the

mat

💡 Key insight:

•Attention is a mechanism that allows AI to "look at" different parts of the text.
•Attention strength shows how important one word is for understanding another.
•It's like reading a sentence and remembering key words.

Deep Dive: How Attention Works

Query, Key, Value — an Analogy

Imagine a library. You come with a question (Query): "I need a book about space." Each book on the shelf has a label (Key): "Astronomy," "Cooking," "Physics." The attention mechanism compares your question with each label to find the most relevant books. Then it takes the content (Value) of relevant books and combines them for an answer.

Why Multiple Heads (Multi-Head)?

A single attention head sees text from only one angle. It's like having only one librarian who searches only by topic. Multi-head attention is several "librarians" working simultaneously: one searches by topic, another by author, a third by publication year. Together they provide a more complete picture of word relationships: grammatical, semantic, and positional.

Frequently asked questions

What is the attention mechanism in Transformers?

Attention is a mechanism that allows the model to weigh the importance of different parts of the input when processing each token. It computes similarity scores between Query, Key, and Value vectors to determine which tokens to focus on.

What is self-attention vs cross-attention?

Self-attention compares tokens within the same sequence (e.g., words in a sentence attending to each other). Cross-attention compares tokens from different sequences (e.g., encoder output attending to decoder input in translation tasks).

Why do Transformers use multi-head attention?

Multi-head attention allows the model to attend to different aspects of the input simultaneously. For example, one head might focus on syntax, another on semantics, and another on long-range dependencies. This parallel processing improves model expressiveness.

How do attention weights work?

Attention weights are probabilities (0 to 1) that indicate how much focus each token should receive. High weights mean strong relevance. Weights are computed using softmax over dot-product similarity scores between Query and Key vectors.

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerAnalysis

Extract the key requirement from a long technical specification

Without technique

Prompt

Here is the technical spec. We need a task management system. It should support task creation, assignee management, status tracking. Need calendar integration, email notifications, mobile app, analytics dashboard, PDF export, multi-language support, dark theme. Critically important: the system must work offline and sync when internet appears.

Response

Main requirements for the task management system:

Task creation and assignment
Calendar integration
Email notifications
Mobile app
Analytics dashboard
PDF export
Multi-language
Dark theme

Tokens:88/65

Time:480ms

Quality:

With Attention

With technique

Prompt

**CRITICALLY IMPORTANT: the system must work offline and sync when internet appears.** Additional requirements: task creation, assignee management, calendar, notifications, mobile app, dashboard, PDF export, multi-language, dark theme. What architectural decision defines the entire system? Start with it.

Response

Key architectural requirement: OFFLINE-FIRST.

This defines everything: need a local database (SQLite/IndexedDB), CRDT or conflict resolution for sync, Service Worker for PWA. Other features (calendar, PDF, notifications) are standard on any stack. But offline-sync fundamentally changes architecture choice.

👁️In the baseline the critical "offline" requirement is at the end — model attention to it is weakened

🧠Primacy effect: information at the start of the prompt receives more attention during response generation

✅Moving the critical requirement to the start + bolding it directed the model attention

Tokens:82/70

Time:510ms

Quality:

Why this works

The transformer attention mechanism "sees" the start and end of the prompt more strongly (primacy/recency). Place critical information in these positions, not in the middle.

1 / 2

Practice Challenges

Create a free account to solve challenges

4 AI-verified challenges for this lesson

Related lessons:Embeddings Transformers

This lesson is part of a structured LLM course.

My Learning Path

Lesson 3

Attention Mechanism

What to focus on

The Solution: Attention — The Model Learns to Focus

Think of it like a librarian helping you find the right book:

1. Query — "What am I looking for?": The current word creates a "question" vector. For the word "bank", the query is essentially: "What context should I pay attention to?"
2. Key — "What do I have to offer?": Every other word creates a "label" vector describing its content. "River" says: "I'm about nature and water." "Money" says: "I'm about finance."
3. Match — Compare Query with all Keys: The model calculates how well each Key matches the Query. High match = high attention score. "River" gets a high score when "bank" is about terrain.
4. Value — Gather the relevant information: Each word also has a "content" vector (Value). The model takes a weighted combination of all Values based on the attention scores — mostly from the high-scoring words.
5. Multi-Head — Multiple perspectives at once: One attention "head" might focus on grammar (subject-verb), another on meaning (river-bank), a third on position (nearby words). GPT-4 uses 96 heads in each layer!

This is called self-attention because the model attends to its own input — each word asks about all other words in the same text.

Why Is Attention So Important?

Context understanding: the same word gets different representations depending on surrounding words — "bank" becomes "riverbank" or "financial bank"
Long-range connections: attention can connect words far apart in text — a pronoun "she" at position 100 can attend to a name at position 5
Parallel processing: unlike older models (RNNs) that read word by word, attention processes ALL words at once on GPUs — making training much faster

Try It Yourself!

Below is an interactive visualization. See how different words "pay attention" to each other — hover over a word to see its attention pattern!

👁️ How does AI understand context?

AI looks at other words in the sentence to understand the meaning of each word.

Example 1/3

The

cat

sat

the

mat

💡 Key insight:

•Attention is a mechanism that allows AI to "look at" different parts of the text.
•Attention strength shows how important one word is for understanding another.
•It's like reading a sentence and remembering key words.

Deep Dive: How Attention Works

Query, Key, Value — an Analogy

Why Multiple Heads (Multi-Head)?

Frequently asked questions

What is the attention mechanism in Transformers?

What is self-attention vs cross-attention?

Why do Transformers use multi-head attention?

How do attention weights work?

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerAnalysis

Extract the key requirement from a long technical specification

Without technique

Prompt

Response

Main requirements for the task management system:

Task creation and assignment
Calendar integration
Email notifications
Mobile app
Analytics dashboard
PDF export
Multi-language
Dark theme

Tokens:88/65

Time:480ms

Quality:

With Attention

With technique

Prompt

Response

Key architectural requirement: OFFLINE-FIRST.

👁️In the baseline the critical "offline" requirement is at the end — model attention to it is weakened

🧠Primacy effect: information at the start of the prompt receives more attention during response generation

✅Moving the critical requirement to the start + bolding it directed the model attention

Tokens:82/70

Time:510ms

Quality:

Why this works

The transformer attention mechanism "sees" the start and end of the prompt more strongly (primacy/recency). Place critical information in these positions, not in the middle.

1 / 2

Practice Challenges

Create a free account to solve challenges

4 AI-verified challenges for this lesson

Related lessons:Embeddings Transformers

This lesson is part of a structured LLM course.

My Learning Path