Tokenization
How AI reads text
The Problem: Computers only understand numbers, not text. How can we feed a sentence like "I love cats" into a neural network that only works with numbers?
The Solution: A Dictionary for AI
Tokenization is the process of breaking text into small pieces called tokens and assigning each piece a number. Think of it as creating a dictionary where each entry (word, part of a word, or even a single character) gets a unique ID number.
Modern LLMs don't split by whole words (too many unique words!) or by individual characters (sentences become too long!). Instead, they use subword tokenization — a smart middle ground that splits text into frequent word parts. The most popular algorithm for this is called Byte Pair Encoding (BPE). After tokenization, each token is converted into an embedding — a numerical vector the model can work with.
Think of it like building a phrase book from scratch:
- 1. Start with characters: Split all training text into individual characters: "cat" → ["c", "a", "t"]. This is the initial vocabulary.
- 2. Count adjacent pairs: Look at the whole training text and find which pairs of adjacent tokens appear most often. For example, "t"+"h" might be the most common pair in English.
- 3. Merge the most frequent pair: Create a new token from this pair: "t"+"h" → "th". Add "th" to the vocabulary. Now "the" is ["th", "e"] instead of ["t", "h", "e"].
- 4. Repeat thousands of times: Keep merging the most frequent pairs until the vocabulary reaches the target size (e.g., 50,000 tokens for GPT-4). Common words like "the" become single tokens, while rare words get split into parts.
- 5. Result: smart vocabulary: "unhappiness" → ["un", "happiness"] — the tokenizer recognizes common prefixes and roots. "ChatGPT" → ["Chat", "G", "PT"] — rare words split into known pieces.
This is why tokenization is so important — the quality of token splitting directly affects how well the model understands and generates text!
Why Does This Matter?
- Token limits: when you hear "4K context" or "128K context" — that's measured in tokens, not words. One word ≈ 1-3 tokens.
- Pricing: API pricing is per token. The same meaning costs different amounts in different languages because tokenization differs!
- Multilingual: BPE works with bytes, so it handles any language — Chinese, Arabic, emoji. But less common languages use more tokens per word.
- Code: programming keywords like "function" or "return" are often single tokens, while variable names get split.
Fun Fact: GPT-4's tokenizer has about 100,000 tokens in its vocabulary. The word "programming" is one token, but "supercalifragilisticexpialidocious" gets split into 11 tokens! And a single family emoji like 👨👩👧👦 can use up to 18 tokens.
Try It Yourself!
Below is an interactive tokenizer. Try typing different texts and see how they get split into tokens!
🧩 How does AI read text?
AI doesn't understand letters directly. It breaks text into pieces — tokens.
We start with the original text as one piece
💡 Key insight:
- •A token isn't necessarily a word. It can be part of a word or even a punctuation mark.
- •Common words = 1 token. Rare words = multiple tokens.
- •More tokens = more expensive AI request.
Frequently asked questions
What is tokenization in LLMs?
Tokenization is the process of breaking down text into smaller units called tokens. Large Language Models use algorithms like Byte-Pair Encoding (BPE) to split text into subword units, which allows them to handle any word, including rare or misspelled ones.
Why do different models tokenize text differently?
Each model uses its own tokenizer trained on different datasets. For example, GPT-4 uses cl100k_base tokenizer with ~100K tokens, while Claude uses a different vocabulary. This affects how efficiently models process text in different languages.
How many tokens is one word?
It varies! Common English words are usually 1 token (e.g., 'the', 'is'). Longer or less common words may be 2-3 tokens. Technical terms, non-English words, or special characters often use more tokens. On average, 1 token ≈ 0.75 words in English.
Why does tokenization matter for AI?
Tokenization directly impacts model efficiency, cost, and context window usage. More tokens = higher API costs and faster context limit consumption. Understanding tokenization helps optimize prompts and predict costs accurately.
Try it yourself
Interactive demo of this technique
Get a short product description for a product card
XSound Pro 500 are innovative wireless headphones with excellent sound, long battery life and modern design. Perfect for music, work and travel.
XSound Pro 500 — wireless headphones with active noise cancellation for complete music immersion. 30 hours on a single charge. Bluetooth 5.3 provides stable lag-free connection. Soft ear cushions and foldable design — comfort all day long.
Understanding tokenization saves costs: removing filler from the prompt cut input from 95 to 38 tokens (−60%) and produced a better response. Every token should carry meaning.
Create a free account to solve challenges
2 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path