Tokens are not words
The most common misconception about LLM tokenization is that tokens equal words. They don't. A token is a chunk of text that the model's vocabulary recognizes as a single unit. It might be a whole word, part of a word, a punctuation mark, or even a space character.
Take the word tokenization. Under the cl100k_base encoding used by GPT-4 and ChatGPT, it splits into two tokens: token and ization. The model never sees the full string "tokenization" as one thing — it sees two separate integer IDs that represent those two chunks.
Paste "tokenization" into Token Compare and watch it split into exactly 2 colored segments. Then try "token" alone — just 1 token.
How Byte Pair Encoding (BPE) works
GPT-4 uses a tokenization algorithm called Byte Pair Encoding, originally developed for data compression. Here is the core idea in three steps:
- Start with individual characters. Every byte of text (letters, spaces, punctuation) starts as its own token.
- Merge the most frequent pairs. The algorithm scans a massive training corpus and merges the two tokens that appear together most often into a single new token. "t" + "h" becomes "th", then "th" + "e" becomes "the", and so on.
- Repeat until the vocabulary is full. This process continues until the vocabulary reaches the target size. For cl100k_base, that's approximately 100,277 tokens.
The result is a vocabulary where common English words and subwords are single tokens, while rare words get split into multiple tokens. This is why "cat" is 1 token but "concatenate" is 3 tokens (con, caten, ate).
The cl100k_base encoding
Not all LLMs use the same tokenizer. Different models have different vocabularies, different merge rules, and different ways of handling whitespace and special characters. Here are the tokenizers used by major models:
| Encoding | Models | Vocab size |
|---|---|---|
| cl100k_base | GPT-4, GPT-4o, GPT-3.5-turbo, ChatGPT | ~100,277 |
| o200k_base | GPT-4o (newer versions) | ~200,000 |
| p50k_base | text-davinci-003, Codex | ~50,281 |
| r50k_base | GPT-3, older models | ~50,281 |
Token Compare uses cl100k_base, which covers GPT-4 and ChatGPT — the most widely used models today. If you're building with GPT-4o using the newer o200k_base, token counts will be slightly different because the vocabulary is larger and merges are more aggressive.
Why some text costs more tokens than others
Several factors determine how many tokens a given piece of text uses:
Word frequency in training data
Common English words like "the", "is", "you", "have" are almost always single tokens because the BPE merging process absorbed them early. Rare words, technical jargon, and proper nouns often split into 2-5 tokens each.
Non-English languages
The cl100k_base vocabulary was trained primarily on English text. Non-Latin scripts (Chinese, Japanese, Arabic, Hindi) use many more tokens per character than English. A Chinese character that represents a complete word might require 2-3 tokens. Japanese text with kanji, hiragana, and katakana can be particularly expensive.
Numbers and special characters
Numbers are tokenized digit by digit in many cases. The string "2024" might be 1 token, but "202400" might be 2 tokens. URLs are notoriously token-heavy because they contain many uncommon character combinations. A typical URL like https://example.com/api/v2/users/123 can easily consume 15-20 tokens.
Whitespace and formatting
Whitespace matters in tokenization. Leading spaces before words are often merged with the word itself: " hello" (space + hello) is a different token than "hello". This is why reformatting code or changing indentation can subtly alter token counts.
| Text | Token count | Notes |
|---|---|---|
| Hello, world! | 4 | Hello / , / world / ! |
| The cat sat on the mat. | 7 | All common words = 1 token each |
| Supercalifragilistic | 6 | Rare word splits heavily |
| def calculate_fibonacci(n): | 8 | Code has varied token density |
| 你好世界 | 6 | Chinese characters are token-expensive |
What tokens mean for API costs and context windows
LLM APIs charge per token. OpenAI's pricing applies to both input tokens (your prompt) and output tokens (the model's response). As of early 2026, GPT-4 Turbo costs roughly $10 per million input tokens and $30 per million output tokens. At those prices, a 1,000-token prompt sent 10,000 times per day costs around $100 per day just in input tokens.
Context windows are also measured in tokens. GPT-4 Turbo's 128,000-token context window sounds large, but if your system prompt is 2,000 tokens, your user message is 5,000 tokens, and you want a 3,000-token response, you're already at 10,000 tokens per conversation turn.
A system prompt that says "Please be helpful, accurate, thorough, concise, and professional" uses approximately 12 tokens. Rewriting it as "Be helpful and direct" uses 4 tokens. At scale, these small savings compound quickly.
Token efficiency as a metric
Token Compare shows a "token efficiency" percentage, which measures unique tokens as a proportion of total tokens. A low efficiency score means your text has lots of token repetition — often a sign that a prompt is more verbose than it needs to be. A high efficiency score means most tokens carry unique information.
This metric is most useful when comparing two versions of the same prompt: the version with higher token efficiency typically packs more information into fewer tokens.
Key takeaways
- Tokens are subword units, not words — one word can be multiple tokens
- GPT-4 and ChatGPT use cl100k_base with ~100,277 tokens in its vocabulary
- English text is token-efficient; non-Latin scripts, numbers, and URLs are expensive
- Both API costs and context window limits are denominated in tokens
- Paste any text into Token Compare to see live token visualization and counts