Lesson 1

Tokenization

How AI reads text

The Problem: Computers only understand numbers, not text. How can we feed a sentence like "I love cats" into a neural network that only works with numbers?

The Solution: A Dictionary for AI

Tokenization is the process of breaking text into small pieces called tokens and assigning each piece a number. Think of it as creating a dictionary where each entry (word, part of a word, or even a single character) gets a unique ID number.

Modern LLMs don't split by whole words (too many unique words!) or by individual characters (sentences become too long!). Instead, they use subword tokenization — a smart middle ground that splits text into frequent word parts. The most popular algorithm for this is called Byte Pair Encoding (BPE). After tokenization, each token is converted into an embedding — a numerical vector the model can work with.

Think of it like building a phrase book from scratch:

1. Start with characters: Split all training text into individual characters: "cat" → ["c", "a", "t"]. This is the initial vocabulary.
2. Count adjacent pairs: Look at the whole training text and find which pairs of adjacent tokens appear most often. For example, "t"+"h" might be the most common pair in English.
3. Merge the most frequent pair: Create a new token from this pair: "t"+"h" → "th". Add "th" to the vocabulary. Now "the" is ["th", "e"] instead of ["t", "h", "e"].
4. Repeat thousands of times: Keep merging the most frequent pairs until the vocabulary reaches the target size (e.g., 50,000 tokens for GPT-4). Common words like "the" become single tokens, while rare words get split into parts.
5. Result: smart vocabulary: "unhappiness" → ["un", "happiness"] — the tokenizer recognizes common prefixes and roots. "ChatGPT" → ["Chat", "G", "PT"] — rare words split into known pieces.

This is why tokenization is so important — the quality of token splitting directly affects how well the model understands and generates text!

Why Does This Matter?

Token limits: when you hear "4K context" or "128K context" — that's measured in tokens, not words. One word ≈ 1-3 tokens.
Pricing: API pricing is per token. The same meaning costs different amounts in different languages because tokenization differs!
Multilingual: BPE works with bytes, so it handles any language — Chinese, Arabic, emoji. But less common languages use more tokens per word.
Code: programming keywords like "function" or "return" are often single tokens, while variable names get split.

Fun Fact: GPT-4's tokenizer has about 100,000 tokens in its vocabulary. The word "programming" is one token, but "supercalifragilisticexpialidocious" gets split into 11 tokens! And a single family emoji like 👨‍👩‍👧‍👦 can use up to 18 tokens.

Try It Yourself!

Below is an interactive tokenizer. Try typing different texts and see how they get split into tokens!

🧩 How does AI read text?

AI doesn't understand letters directly. It breaks text into pieces — tokens.

✏️ Enter text:

Step 1

We start with the original text as one piece

Hello world!

💡 Key insight:

•A token isn't necessarily a word. It can be part of a word or even a punctuation mark.
•Common words = 1 token. Rare words = multiple tokens.
•More tokens = more expensive AI request.

Frequently asked questions

What is tokenization in LLMs?

Tokenization is the process of breaking down text into smaller units called tokens. Large Language Models use algorithms like Byte-Pair Encoding (BPE) to split text into subword units, which allows them to handle any word, including rare or misspelled ones.

Why do different models tokenize text differently?

Each model uses its own tokenizer trained on different datasets. For example, GPT-4 uses cl100k_base tokenizer with ~100K tokens, while Claude uses a different vocabulary. This affects how efficiently models process text in different languages.

How many tokens is one word?

It varies! Common English words are usually 1 token (e.g., 'the', 'is'). Longer or less common words may be 2-3 tokens. Technical terms, non-English words, or special characters often use more tokens. On average, 1 token ≈ 0.75 words in English.

Why does tokenization matter for AI?

Tokenization directly impacts model efficiency, cost, and context window usage. More tokens = higher API costs and faster context limit consumption. Understanding tokenization helps optimize prompts and predict costs accurately.

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerAnalysis

Get a short product description for a product card

Without technique

Prompt

Could you please be so kind as to write me a fairly detailed yet at the same time not too long description for our wonderful new product — the wireless headphones model XSound Pro 500, which we are going to place on our online store, keeping in mind that the description should be attractive to buyers?

Response

XSound Pro 500 are innovative wireless headphones with excellent sound, long battery life and modern design. Perfect for music, work and travel.

Tokens:95/42

Time:520ms

Quality:

With Tokens

With technique

Prompt

Write a product description for a product card (40-60 words). Product: XSound Pro 500 wireless headphones. Include: noise cancellation, 30h battery, Bluetooth 5.3.

Response

XSound Pro 500 — wireless headphones with active noise cancellation for complete music immersion. 30 hours on a single charge. Bluetooth 5.3 provides stable lag-free connection. Soft ear cushions and foldable design — comfort all day long.

👁️The 95-token prompt contains ~60% filler — polite phrases that do not affect the result

🧠Remove politeness and repetition → 38 tokens. 60% savings with the same (and better) response quality

✅Specific features in the prompt → concrete facts in the response instead of generic phrases

Tokens:38/48

Time:410ms

Quality:

Why this works

Understanding tokenization saves costs: removing filler from the prompt cut input from 95 to 38 tokens (−60%) and produced a better response. Every token should carry meaning.

1 / 2

Practice Challenges

Create a free account to solve challenges

2 AI-verified challenges for this lesson

Related lessons:Embeddings Prompt Basics

This lesson is part of a structured LLM course.

My Learning Path

Lesson 1

Tokenization

How AI reads text

The Problem: Computers only understand numbers, not text. How can we feed a sentence like "I love cats" into a neural network that only works with numbers?

The Solution: A Dictionary for AI

Think of it like building a phrase book from scratch:

1. Start with characters: Split all training text into individual characters: "cat" → ["c", "a", "t"]. This is the initial vocabulary.
2. Count adjacent pairs: Look at the whole training text and find which pairs of adjacent tokens appear most often. For example, "t"+"h" might be the most common pair in English.
3. Merge the most frequent pair: Create a new token from this pair: "t"+"h" → "th". Add "th" to the vocabulary. Now "the" is ["th", "e"] instead of ["t", "h", "e"].
4. Repeat thousands of times: Keep merging the most frequent pairs until the vocabulary reaches the target size (e.g., 50,000 tokens for GPT-4). Common words like "the" become single tokens, while rare words get split into parts.
5. Result: smart vocabulary: "unhappiness" → ["un", "happiness"] — the tokenizer recognizes common prefixes and roots. "ChatGPT" → ["Chat", "G", "PT"] — rare words split into known pieces.

This is why tokenization is so important — the quality of token splitting directly affects how well the model understands and generates text!

Why Does This Matter?

Token limits: when you hear "4K context" or "128K context" — that's measured in tokens, not words. One word ≈ 1-3 tokens.
Pricing: API pricing is per token. The same meaning costs different amounts in different languages because tokenization differs!
Multilingual: BPE works with bytes, so it handles any language — Chinese, Arabic, emoji. But less common languages use more tokens per word.
Code: programming keywords like "function" or "return" are often single tokens, while variable names get split.

Try It Yourself!

Below is an interactive tokenizer. Try typing different texts and see how they get split into tokens!

🧩 How does AI read text?

AI doesn't understand letters directly. It breaks text into pieces — tokens.

✏️ Enter text:

Step 1

We start with the original text as one piece

Hello world!

💡 Key insight:

•A token isn't necessarily a word. It can be part of a word or even a punctuation mark.
•Common words = 1 token. Rare words = multiple tokens.
•More tokens = more expensive AI request.

Frequently asked questions

What is tokenization in LLMs?

Why do different models tokenize text differently?

How many tokens is one word?

Why does tokenization matter for AI?

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerAnalysis

Get a short product description for a product card

Without technique

Prompt

Response

XSound Pro 500 are innovative wireless headphones with excellent sound, long battery life and modern design. Perfect for music, work and travel.

Tokens:95/42

Time:520ms

Quality:

With Tokens

With technique

Prompt

Write a product description for a product card (40-60 words). Product: XSound Pro 500 wireless headphones. Include: noise cancellation, 30h battery, Bluetooth 5.3.

Response

👁️The 95-token prompt contains ~60% filler — polite phrases that do not affect the result

🧠Remove politeness and repetition → 38 tokens. 60% savings with the same (and better) response quality

✅Specific features in the prompt → concrete facts in the response instead of generic phrases

Tokens:38/48

Time:410ms

Quality:

Why this works

Understanding tokenization saves costs: removing filler from the prompt cut input from 95 to 38 tokens (−60%) and produced a better response. Every token should carry meaning.

1 / 2

Practice Challenges

Create a free account to solve challenges

2 AI-verified challenges for this lesson

Related lessons:Embeddings Prompt Basics

This lesson is part of a structured LLM course.

My Learning Path