LLM Foundations/How LLMs Work
Beginner14 min

Tokenization Deep Dive

How LLMs split text into tokens, why BPE is the dominant algorithm, and the engineering decisions that follow: cross-provider cost math, pre-flight token counting (including Anthropic's Token Counting API), content-type multipliers, and the production traps that blow budgets.

Quick Reference

  • Tokens are subword units — neither characters nor whole words, but something in between
  • BPE builds a vocabulary by iteratively merging the most frequent adjacent byte pairs
  • 1 token ~ 4 characters in English via o200k_base (200K-vocab), but varies by language and content
  • Token count determines API cost, context window usage, and latency
  • Different providers use different tokenizers — never assume counts transfer across models
  • Claude Opus 4.7 ships a new tokenizer that uses up to 35% more tokens on the same input
  • Prompt caching is the highest-leverage cost lever — cache hits cost 0.1× the input token price

What Are Tokens (and Why Not Words or Characters)?

LLMs don't see text. Before any processing happens, your input is split into tokens — subword units the model treats as atomic symbols. The word 'unhappiness' might become ['un', 'happiness'] or ['un', 'happi', 'ness'] depending on the tokenizer. Common words like 'the' or 'is' are single tokens; rare words get split into pieces. Every cost figure, context limit, and rate limit you'll ever hit is measured in tokens, not words or characters.

Why subwords instead of characters or whole words?

Character-level tokenization produces tiny vocabularies but very long sequences — the model processes one letter at a time, making long-range dependencies harder to learn. Word-level tokenization can't handle misspellings, new words, or agglutinative languages without an 'unknown' fallback. Subword tokenization is the practical middle ground: a vocabulary of 32K–200K tokens can represent any text by combining subword pieces, and common words remain single tokens so sequences stay short.

InputApprox tokens (o200k_base)Notes
Hello world2Common words → single tokens each
authentication1Common enough to be one token
supercalifragilistic5Rare compound → split into pieces
{"key": "value"}7Each brace, quote, colon is a token
你好世界4CJK: typically 1 token per character
These breakdowns are for GPT models only

The table above uses OpenAI's o200k_base encoding. Anthropic's tokenizer is not public — Claude may split the same words differently. For Claude, use the Token Counting API (see 'Counting Tokens Before You Send') to get exact counts. Never assume token counts transfer across providers.

BPE: How Tokenizers Learn

Byte Pair Encoding starts with individual bytes as the base vocabulary, then iteratively merges the most frequently co-occurring adjacent pair into a new token. This repeats until the vocabulary reaches the target size. The result: common subwords ('ing', 'tion', 'the') become single tokens; rare sequences get broken into smaller pieces that were merged earlier.

BPE Training: Iterative Pair MergingStep 0 — Startlow·low·newEvery character is a separate tokenStep 1 — Merge 'l' + 'o' → 'lo' (most frequent pair, 2×)low·low·new9 tokens → merged pair highlightedStep 2 — Merge 'lo' + 'w' → 'low' (appears 2×)low·low·new7 tokens → 'low' is now a single tokenStep 3 — Merge 'e' + 'w' → 'ew'low·low·new6 tokensStep 4 — Merge 'n' + 'ew' → 'new'low·low·new5 tokens — corpus fully compressed

BPE merges the most frequent adjacent pair each iteration — common subwords become single tokens

Simplified BPE training — building intuition for the merge process
Modern BPE variants

OpenAI's o200k_base operates on bytes (not characters), so it handles any Unicode text without unknown tokens. The vocabulary grew from cl100k_base (100K) to o200k_base (200K) with GPT-4o, improving multilingual compression. SentencePiece (used by Llama) adds a unigram model for probabilistic tokenization. Meta's BLT (Byte Latent Transformer, Dec 2024) eliminates tokenization entirely by operating on raw byte patches — it matches GPT-level performance at scale and removes entire categories of tokenization failure modes, though it remains research-stage.

Vocabulary Sizes and Tokenizer Versioning

Tokenizer vocabulary size has grown significantly as models scaled. GPT-2 used a 50K-token vocabulary (p50k_base). GPT-3.5/4 moved to 100K (cl100k_base). GPT-4o and the entire GPT-5 family use 200K (o200k_base). Larger vocabularies compress more common subwords into single tokens, improving efficiency for common languages. The tradeoff: larger vocabularies mean larger embedding matrices and more parameters.

TokenizerVocab sizeModelsNotes
p50k_base50,237GPT-2, text-davinci-003Legacy
cl100k_base100,277GPT-3.5-turbo, GPT-4Still used for some fine-tuning
o200k_base200,019GPT-4o, GPT-5 family, o1, o3Current default for new OpenAI models
o200k_harmony200,019GPT-5 variantsExtended BPE for improved multilingual coverage
Claude Opus 4.7 ships a new tokenizer — budget for it

Anthropic's Claude Opus 4.7 uses a new tokenizer that may use up to 35% more tokens on the same input compared to earlier Claude models. If you built cost projections against Claude Opus 3 or 4.x and are upgrading to Opus 4.7, revalidate your token budgets with the Token Counting API before launching. A 35% increase at $5/MTok on 1B tokens/month is $175K/year.

tiktoken and model name resolution

tiktoken's encoding_for_model() resolves model names to encodings. For current models use tiktoken.get_encoding('o200k_base') directly — it's more reliable than model name lookup, which has open issues for newer model variants. GPT-4o, GPT-5, and o-series models all use o200k_base.

Token Economics: The Math That Hits Your Bill

Every token you send or receive costs money. The providers charge per million tokens (MTok). Here are current prices (verified April 2026) for the models in this course — use these as starting points, not gospel:

ModelInput ($/MTok)Output ($/MTok)Cache hit ($/MTok)
Claude Opus 4.7$5.00$25.00$0.50
Claude Sonnet 4.6$3.00$15.00$0.30
Claude Haiku 4.5$1.00$5.00$0.10
GPT-5.4 (short ctx)$2.50$15.00~$0.50 (est.)
GPT-5 (base)$0.625$5.00N/A
Output tokens cost 5–10× more than input tokens

Output tokens are generated one at a time (auto-regressive decoding) and cost disproportionately more. A 500-token response to a 2,000-token prompt costs roughly the same as another 2,500-input-tokens at Claude Opus 4.7 pricing. Control output length — constrain responses, use JSON with explicit field lists, and avoid asking the model to 'explain step by step' when a direct answer will do.

Prompt caching cuts the effective cost of repeated context by 90%. At Claude Opus 4.7 rates, 1M cached input tokens cost $0.50 instead of $5.00. For a chatbot with a 2K-token system prompt receiving 100K requests/day, caching that prompt saves roughly $300/day at Opus 4.7 pricing. Token economics and caching are inseparable — any serious cost analysis should model both.

Real project

A team built a document analysis pipeline that processed 10K documents/day at an average of 8K tokens each (80M tokens/day). They were using Claude Opus 4.7 for all documents at $5/MTok input — $400/day just in input costs. After two changes — routing simple classification to Haiku 4.5 ($1/MTok) and caching a shared 3K-token system prompt across all requests — their daily input cost dropped to under $80/day without any change in output quality for the complex tasks.

Learn this in → Model routing + prompt caching together are typically worth 5–10× more than any prompt compression trick.

Counting Tokens Before You Send

Pre-flight token counting lets you enforce context window budgets, route requests to cheaper models, and catch prompt bloat before it hits your bill. Both major providers support this — the approaches differ.

OpenAI — tiktoken (local, free, fast)
Anthropic Token Counting API — exact, free, all models

Anthropic publishes a dedicated /v1/messages/count_tokens endpoint. It accepts the exact same payload as /v1/messages (system prompt, messages, tools, images, PDFs) and returns the token count before any billing occurs. It is free to use and rate-limited separately from message creation. Use this instead of character-count heuristics — the 3.5 chars/token rule of thumb was never accurate and is now especially wrong for Opus 4.7's new tokenizer.

Anthropic — Token Counting API (pre-flight, free)
Anthropic Token Counting API — TypeScript
Token count is an estimate, not a guarantee

Anthropic's token counting response is documented as an estimate — the actual token count used in a message may differ slightly due to server-side system optimizations. You are not billed for system-added tokens. Build in 5% headroom when using token counts for hard budget gates.

What's Expensive: Content Type Token Multipliers

Token density varies by content type. The ratios below are measured via tiktoken o200k_base on representative 100-character samples. They provide directional guidance — your actual ratios will differ based on specific content. Claude Opus 4.7's new tokenizer may add ~35% on top of these counts.

Token Density per 100 Charactersmeasured via tiktoken o200k_base — relative multiplier vs English proseEnglish prose25 tokbaselinePython code34 tok1.4×JSON (formatted)42 tok1.7×XML / HTML48 tok1.9×Chinese / Japanese55 tok2.2×Minified JSON58 tok2.3×Base64 data68 tok2.7×Claude Opus 4.7's new tokenizer may add ~35% on top of these counts — always measure with the Token Counting API

Content type is a bigger cost driver than most engineers expect — prose first, JSON last

JSON is expensive — rethink structured inputs

JSON's token cost comes from its scaffolding: every brace, quote, colon, and comma is a separate token. A 1,000-token JSON payload carries roughly 300–400 tokens of pure punctuation overhead. When feeding structured data into a prompt, consider: (1) natural language summaries instead of raw JSON, (2) CSV or pipe-delimited formats for tabular data, (3) stripping null and default fields before sending.

Multilingual fairness: non-English users pay more

A Chinese speaker asking the same question as an English speaker will typically use 2–3× more tokens. At identical per-token pricing, this is a real cost and latency disadvantage. When building multilingual products, factor this into cost models, context window sizing, and user-facing rate limits — the 'same' limit is not the same in practice across languages.

Tokenization Failure Modes

  • Token boundaries break words unpredictably: 'ChatGPT' might be ['Chat', 'G', 'PT'] — the model doesn't see it as one word, which affects tasks like spell checking and named entity recognition
  • Trailing whitespace matters: ' Hello' and 'Hello' tokenize differently. Inconsistent whitespace in prompt templates causes silent tokenization drift
  • Numbers split into digit groups: '123456' → ['123', '456'] in many tokenizers. The specific split is tokenizer-dependent and can change across model versions
  • Special tokens (BOS, EOS, tool-use markers) consume context window space invisibly — Claude adds ~346 tokens of system overhead when tools are provided
  • Emoji and special Unicode: a single emoji may be 2–4 tokens; combining sequences (skin tone modifiers) can be even more
  • Tokenizer versioning is a production concern: upgrading a model may change token counts for the same prompt, breaking hardcoded length checks
The number tokenization trap

LLMs tokenize numbers as arbitrary byte sequences, not as numeric values. '12345' might become ['123', '45'] in one model and ['1', '2345'] in another. The model processes these as opaque text fragments, which is why LLMs fail at multi-digit arithmetic without tool use. For any computation involving numbers — arithmetic, sorting, comparison — always delegate to code execution. Prompt engineering cannot fix a tokenization-level limitation.

Real project

A production chatbot failed at date arithmetic because it asked the LLM to calculate 'how many days between 2024-03-15 and 2024-11-30'. The model got the right answer 80% of the time but failed on edge cases involving month-end boundaries. The fix was a single tool call to a date library — the LLM describes what to compute, the tool computes it. Zero failures after.

Learn this in → If a task requires arithmetic or string comparison, give the model a calculator — don't optimize the prompt.

Token-Aware Design: Prompt Caching and Format Choices

The two highest-leverage token cost optimizations are prompt caching and content format. Prompt caching is structural — it requires no prompt changes. Format optimization is about what you put in the prompt.

Prompt caching — structure your prompts for maximum cache hits

Both Anthropic and OpenAI support prompt caching. Cache hits cost ~0.1× the standard input price. To maximize hits: (1) put stable content first — system prompt, instructions, reference documents; (2) put variable content last — user messages, dynamic context; (3) use explicit cache_control markers with Anthropic's API or the automatic caching parameter. A well-structured prompt for a RAG pipeline can achieve 80–90% cache hit rates on the system prompt and document context, cutting effective input costs by 8–9×.

Anthropic prompt caching — mark stable content for cache
  • Prefer natural language over JSON for prompt inputs when possible — saves 1.4–2× on token count
  • Strip null fields, default values, and schema metadata from JSON payloads before sending
  • For tabular data: CSV or pipe-delimited formats use 30–40% fewer tokens than JSON
  • Chunk large documents instead of sending them whole — each chunk can be cached independently
  • When streaming is the goal, output token count is fixed by the task — compress inputs instead
  • Remove redundant whitespace and repeated line breaks from data payloads (not from prose prompts where formatting aids comprehension)

Beyond BPE: Byte-Level Models

BPE has served well since 2016, but tokenization introduces failure modes that can't be patched away: number arithmetic, character manipulation, cross-language cost inequity, and sensitivity to whitespace. Meta's Byte Latent Transformer (BLT, December 2024) is the most credible alternative to date. BLT processes raw byte sequences without a tokenizer, dynamically grouping bytes into patches based on data complexity. At equivalent compute, BLT matches Llama 3 on standard benchmarks while using up to 50% fewer inference FLOPs on low-entropy text. Weights are available for 1B and 7B models. Production-scale deployment is still forthcoming, but the research makes clear that tokenization is an architectural choice, not a permanent constraint.

Practical implication for engineers

Tokenization-free models won't change the economics you need to manage today — you still need to count tokens and manage context budgets for every current production system. But be aware that the 'characters to token' mental model will become obsolete as byte-level architectures reach production scale. Design your token-counting and cost infrastructure around APIs (like Anthropic's count_tokens endpoint) rather than static heuristics — those will survive the transition.

Best Practices

Best Practices

Do

  • Use the Anthropic Token Counting API (/v1/messages/count_tokens) for pre-flight budget enforcement — it's free and exact
  • Use tiktoken.get_encoding('o200k_base') for OpenAI GPT-4o/5 models — more reliable than encoding_for_model() for newer variants
  • Structure prompts with stable content first (system prompt, documents) and variable content last to maximize prompt cache hits
  • Revalidate token budgets when upgrading Claude models — Opus 4.7's new tokenizer uses up to 35% more tokens than prior versions
  • Route simple tasks to cheaper models (Haiku, GPT-5 base) and reserve expensive models for complex reasoning
  • Leave 10–20% context window headroom — tool-use system prompts, message formatting, and special tokens add invisible overhead
  • Measure actual token ratios for your specific content — the 4 chars/token rule is a starting point, not a budget figure
  • Use explicit cache_control markers in Anthropic API calls for any system prompt or document exceeding 1K tokens

Don’t

  • Don't assume token counts transfer across providers — Claude and GPT tokenize the same text differently
  • Don't use character count heuristics for Claude token estimation — the 3.5 chars/token rule is inaccurate and worsens with Opus 4.7
  • Don't send raw JSON blobs when a natural language summary or CSV equivalent would work — JSON carries 1.7× the token overhead
  • Don't trust LLMs for arithmetic, sorting, or string comparison — tokenization makes these fundamentally unreliable without tool use
  • Don't hardcode token limits — they change as models update and tokenizer vocabularies evolve
  • Don't ignore multilingual token costs — CJK text uses 2–3× more tokens than English for the same semantic content
  • Don't put variable content before stable content in prompts — it defeats prompt caching and multiplies input costs
  • Don't skip pre-flight token counting for user-provided inputs — a malicious or accidental prompt injection can exhaust your context budget

Key Takeaways

  • Tokens are the atomic unit LLMs operate on — every cost, latency, and context limit is measured in tokens, not words.
  • BPE builds vocabulary by merging frequent byte pairs; o200k_base (200K vocab) is the current standard for OpenAI models.
  • Claude Opus 4.7 ships a new tokenizer using up to 35% more tokens — revalidate cost budgets before upgrading.
  • JSON and code are 1.4–2.7× more token-expensive than English prose — content format is a significant cost lever.
  • Anthropic's Token Counting API is free and exact — use it instead of character-count heuristics for pre-flight budget enforcement.
  • Prompt caching (cache hits at 0.1× input price) is the highest-leverage cost optimization available — structure prompts to maximize it.

Video on this topic

Why LLMs can't count: tokenization explained

tiktok