Tokenization
How AI reads text
The Problem: Computers only understand numbers, not text. How can we feed a sentence like "I love cats" into a neural network that only works with numbers?
The Solution: A Dictionary for AI
Tokenization is the process of breaking text into small pieces called tokens and assigning each piece a number. Think of it as creating a dictionary where each entry (word, part of a word, or even a single character) gets a unique ID number.
Modern LLMs don't split by whole words (too many unique words!) or by individual characters (sentences become too long!). Instead, they use subword tokenization — a smart middle ground that splits text into frequent word parts. The most popular algorithm for this is called Byte Pair Encoding (BPE). After tokenization, each token is converted into an embedding — a numerical vector the model can work with.
Think of it like building a phrase book from scratch:
- 1. Start with characters: Split all training text into individual characters: "cat" → ["c", "a", "t"]. This is the initial vocabulary.
- 2. Count adjacent pairs: Look at the whole training text and find which pairs of adjacent tokens appear most often. For example, "t"+"h" might be the most common pair in English.
- 3. Merge the most frequent pair: Create a new token from this pair: "t"+"h" → "th". Add "th" to the vocabulary. Now "the" is ["th", "e"] instead of ["t", "h", "e"].
- 4. Repeat thousands of times: Keep merging the most frequent pairs until the vocabulary reaches the target size (e.g., 50,000 tokens for GPT-4). Common words like "the" become single tokens, while rare words get split into parts.
- 5. Result: smart vocabulary: "unhappiness" → ["un", "happiness"] — the tokenizer recognizes common prefixes and roots. "ChatGPT" → ["Chat", "G", "PT"] — rare words split into known pieces.
This is why tokenization is so important — the quality of token splitting directly affects how well the model understands and generates text!
Why Does This Matter?
- Token limits: when you hear "4K context" or "128K context" — that's measured in tokens, not words. One word ≈ 1-3 tokens.
- Pricing: API pricing is per token. The same meaning costs different amounts in different languages because tokenization differs!
- Multilingual: BPE works with bytes, so it handles any language — Chinese, Arabic, emoji. But less common languages use more tokens per word.
- Code: programming keywords like "function" or "return" are often single tokens, while variable names get split.
Fun Fact: GPT-4's tokenizer has about 100,000 tokens in its vocabulary. The word "programming" is one token, but "supercalifragilisticexpialidocious" gets split into 11 tokens! And a single family emoji like 👨👩👧👦 can use up to 18 tokens.
Try It Yourself!
Below is an interactive tokenizer. Try typing different texts and see how they get split into tokens!
🧩 How does AI read text?
AI doesn't understand letters directly. It breaks text into pieces — tokens.
We start with the original text as one piece
💡 Key insight:
- •A token isn't necessarily a word. It can be part of a word or even a punctuation mark.
- •Common words = 1 token. Rare words = multiple tokens.
- •More tokens = more expensive AI request.
Try it yourself
Interactive demo of this technique
Get a short product description for a product card
XSound Pro 500 are innovative wireless headphones with excellent sound, long battery life and modern design. Perfect for music, work and travel.
XSound Pro 500 — wireless headphones with active noise cancellation for complete music immersion. 30 hours on a single charge. Bluetooth 5.3 provides stable lag-free connection. Soft ear cushions and foldable design — comfort all day long.
Understanding tokenization saves costs: removing filler from the prompt cut input from 95 to 38 tokens (−60%) and produced a better response. Every token should carry meaning.
Create a free account to solve challenges
2 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path