Tokenization Deep Dive
How LLMs break text into tokens, why BPE is the dominant algorithm, and the practical implications for cost, context limits, and multilingual performance. Includes hands-on token counting with tiktoken and cross-model comparisons.
Quick Reference
- →Tokens are subword units -- neither characters nor whole words, but something in between
- →BPE (Byte Pair Encoding) merges frequent character pairs iteratively to build a vocabulary
- →1 token ~ 4 characters in English, but varies wildly across languages and content types
- →Token count directly determines API cost and context window usage
- →Different models use different tokenizers -- GPT-5 and Claude have different token counts for the same text
- →JSON, code, and non-English text are significantly more token-expensive than plain English prose
In this article
What Are Tokens?
LLMs don't see text the way you do. Before any processing happens, your input is split into tokens -- subword units that the model treats as atomic symbols. The word 'unhappiness' might become ['un', 'happiness'] or ['un', 'happ', 'iness'] depending on the tokenizer. Common words like 'the' or 'is' are single tokens, while rare words get split into multiple pieces. This subword approach balances vocabulary size against the ability to represent any text.
Character-level tokenization would give tiny vocabularies but extremely long sequences (the model sees one letter at a time). Word-level tokenization can't handle misspellings, new words, or agglutinative languages. Subword tokenization is the Goldilocks solution: manageable vocabulary (32K-100K tokens) with the ability to represent any text by combining pieces.
| Input | GPT-5 tokens | Token breakdown |
|---|---|---|
| Hello world | 2 | ['Hello', ' world'] |
| authentication | 1 | ['authentication'] |
| supercalifragilistic | 5 | ['super', 'cal', 'ifrag', 'il', 'istic'] |
| {"key": "value"} | 7 | ['{"', 'key', '":', ' "', 'value', '"}', '}'] (approx) |
| 你好世界 | 4 | Each CJK character is typically 1-2 tokens |
BPE: Byte Pair Encoding
Byte Pair Encoding is the most widely used tokenization algorithm. It starts with individual bytes (or characters) as the base vocabulary, then iteratively merges the most frequently co-occurring pair into a new token. This process repeats until the vocabulary reaches a target size. The result is a vocabulary where common subwords ('ing', 'tion', 'the') are single tokens while rare sequences get broken into smaller pieces.
OpenAI uses a BPE variant that operates on bytes (not characters), so it can handle any Unicode text without unknown tokens. SentencePiece (used by Llama) adds a unigram model for probabilistic tokenization. The differences are subtle but affect multilingual performance.
Why Tokenization Matters in Practice
Tokenization is not just an implementation detail -- it directly affects your API costs, context window usage, and application behavior. Understanding token economics is essential for building cost-effective LLM applications.
| Content type | Tokens per 1K chars (English) | Cost multiplier vs prose |
|---|---|---|
| Plain English prose | ~250 | 1x (baseline) |
| Python code | ~350 | ~1.4x |
| JSON data | ~400 | ~1.6x |
| XML / HTML | ~450 | ~1.8x |
| Minified JSON (no spaces) | ~500 | ~2x |
| Chinese / Japanese text | ~500-700 | ~2-2.8x |
| Base64 encoded data | ~600 | ~2.4x |
Structured data formats like JSON are significantly more token-expensive than natural language. Every brace, quote, colon, and comma is a separate token. When designing prompts that include structured data, consider whether you actually need JSON in the input, or if a more compact representation would work.
- ▸Cost: OpenAI charges per token. GPT-5.4 at $2.00/1M input tokens means token efficiency directly affects your bill
- ▸Context limits: A 128K context window holds ~96K words of English but only ~55K words of code or JSON
- ▸Latency: More tokens = longer processing time. Output tokens are especially slow (generated one at a time)
- ▸Multilingual fairness: Chinese text uses 2-3x more tokens than English for the same meaning, making it 2-3x more expensive
Counting Tokens in Practice
Anthropic does not publish their tokenizer publicly. For estimation, use the rule of thumb: 1 token is approximately 3.5 characters for English text. For precise counts, use the token_count field returned in the API response's usage object. Always build in a buffer -- estimate high and leave 10-20% headroom.
Common Tokenization Pitfalls
- ▸Token boundaries break words unpredictably: 'ChatGPT' might be ['Chat', 'G', 'PT'] -- the model doesn't 'see' it as one word
- ▸Trailing whitespace matters: ' Hello' and 'Hello' are different tokens. This affects prompt formatting
- ▸Numbers are tokenized digit-by-digit or in small groups: '123456' becomes ['123', '456'], which is why LLMs struggle with arithmetic
- ▸Special tokens (BOS, EOS, system markers) consume context but are invisible to you -- account for ~10-20 tokens of overhead per message
- ▸Emoji and special Unicode can be surprisingly expensive: a single emoji might be 2-4 tokens
LLMs tokenize numbers unpredictably. The number '12345' might become tokens ['123', '45'] in one model and ['12', '345'] in another. This is a fundamental reason why LLMs are unreliable at arithmetic -- they don't process numbers as numbers, but as arbitrary text fragments. Always use code execution for math.
When optimizing for cost: (1) Prefer natural language over JSON in prompts where possible, (2) Remove unnecessary whitespace and formatting from large data payloads, (3) Summarize or chunk large documents instead of sending them whole, (4) Cache system prompts when the provider supports it (OpenAI prompt caching, Anthropic prompt caching).
Best Practices
Do
- ✓Count tokens before sending requests to avoid context window overflows
- ✓Budget for token overhead: system prompts, message formatting, and special tokens add up
- ✓Use provider-specific token counting tools (tiktoken for OpenAI, API usage fields for others)
- ✓Design data formats with token efficiency in mind -- compact representations save cost and context
Don’t
- ✗Don't assume 1 word = 1 token -- the ratio varies from 0.5 to 3+ depending on content
- ✗Don't send raw JSON blobs when a natural language summary would suffice
- ✗Don't ignore multilingual token costs -- non-English text can be 2-3x more expensive
- ✗Don't rely on LLMs for arithmetic or precise string manipulation -- tokenization makes both unreliable
- ✗Don't hardcode token limits -- they change as models are updated
Key Takeaways
- ✓Tokens are the atomic units LLMs operate on -- every cost, speed, and context limit is measured in tokens.
- ✓BPE builds a vocabulary by iteratively merging frequent byte pairs, creating subword units that balance coverage and efficiency.
- ✓JSON and code are 1.4-2x more token-expensive than English prose -- design your prompts accordingly.
- ✓Different models use different tokenizers, so token counts vary across providers for the same text.
- ✓Number tokenization is fundamentally broken in LLMs -- never trust a model to do arithmetic.
Video on this topic
Why LLMs can't count: tokenization explained
tiktok