What is LLM Tokenization? A Plain-English Explanation

Tokens are not words

The most common misconception about LLM tokenization is that tokens equal words. They don't. A token is a chunk of text that the model's vocabulary recognizes as a single unit. It might be a whole word, part of a word, a punctuation mark, or even a space character.

Take the word tokenization. Under the cl100k_base encoding used by GPT-4 and ChatGPT, it splits into two tokens: token and ization. The model never sees the full string "tokenization" as one thing — it sees two separate integer IDs that represent those two chunks.

Try it

Paste "tokenization" into Token Compare and watch it split into exactly 2 colored segments. Then try "token" alone — just 1 token.

How Byte Pair Encoding (BPE) works

GPT-4 uses a tokenization algorithm called Byte Pair Encoding, originally developed for data compression. Here is the core idea in three steps:

Start with individual characters. Every byte of text (letters, spaces, punctuation) starts as its own token.
Merge the most frequent pairs. The algorithm scans a massive training corpus and merges the two tokens that appear together most often into a single new token. "t" + "h" becomes "th", then "th" + "e" becomes "the", and so on.
Repeat until the vocabulary is full. This process continues until the vocabulary reaches the target size. For cl100k_base, that's approximately 100,277 tokens.

The result is a vocabulary where common English words and subwords are single tokens, while rare words get split into multiple tokens. This is why "cat" is 1 token but "concatenate" is 3 tokens (con, caten, ate).

The cl100k_base encoding

Not all LLMs use the same tokenizer. Different models have different vocabularies, different merge rules, and different ways of handling whitespace and special characters. Here are the tokenizers used by major models:

Encoding	Models	Vocab size
cl100k_base	GPT-4, GPT-4o, GPT-3.5-turbo, ChatGPT	~100,277
o200k_base	GPT-4o (newer versions)	~200,000
p50k_base	text-davinci-003, Codex	~50,281
r50k_base	GPT-3, older models	~50,281

Token Compare uses cl100k_base, which covers GPT-4 and ChatGPT — the most widely used models today. If you're building with GPT-4o using the newer o200k_base, token counts will be slightly different because the vocabulary is larger and merges are more aggressive.

Why some text costs more tokens than others

Several factors determine how many tokens a given piece of text uses:

Word frequency in training data

Common English words like "the", "is", "you", "have" are almost always single tokens because the BPE merging process absorbed them early. Rare words, technical jargon, and proper nouns often split into 2-5 tokens each.

Non-English languages

The cl100k_base vocabulary was trained primarily on English text. Non-Latin scripts (Chinese, Japanese, Arabic, Hindi) use many more tokens per character than English. A Chinese character that represents a complete word might require 2-3 tokens. Japanese text with kanji, hiragana, and katakana can be particularly expensive.

Numbers and special characters

Numbers are tokenized digit by digit in many cases. The string "2024" might be 1 token, but "202400" might be 2 tokens. URLs are notoriously token-heavy because they contain many uncommon character combinations. A typical URL like https://example.com/api/v2/users/123 can easily consume 15-20 tokens.

Whitespace and formatting

Whitespace matters in tokenization. Leading spaces before words are often merged with the word itself: " hello" (space + hello) is a different token than "hello". This is why reformatting code or changing indentation can subtly alter token counts.

Text	Token count	Notes
Hello, world!	4	Hello / , / world / !
The cat sat on the mat.	7	All common words = 1 token each
Supercalifragilistic	6	Rare word splits heavily
def calculate_fibonacci(n):	8	Code has varied token density
你好世界	6	Chinese characters are token-expensive

What tokens mean for API costs and context windows

LLM APIs charge per token. OpenAI's pricing applies to both input tokens (your prompt) and output tokens (the model's response). As of early 2026, GPT-4 Turbo costs roughly $10 per million input tokens and $30 per million output tokens. At those prices, a 1,000-token prompt sent 10,000 times per day costs around $100 per day just in input tokens.

Context windows are also measured in tokens. GPT-4 Turbo's 128,000-token context window sounds large, but if your system prompt is 2,000 tokens, your user message is 5,000 tokens, and you want a 3,000-token response, you're already at 10,000 tokens per conversation turn.

Cost trap

A system prompt that says "Please be helpful, accurate, thorough, concise, and professional" uses approximately 12 tokens. Rewriting it as "Be helpful and direct" uses 4 tokens. At scale, these small savings compound quickly.

Token efficiency as a metric

Token Compare shows a "token efficiency" percentage, which measures unique tokens as a proportion of total tokens. A low efficiency score means your text has lots of token repetition — often a sign that a prompt is more verbose than it needs to be. A high efficiency score means most tokens carry unique information.

This metric is most useful when comparing two versions of the same prompt: the version with higher token efficiency typically packs more information into fewer tokens.

Key takeaways

Tokens are subword units, not words — one word can be multiple tokens
GPT-4 and ChatGPT use cl100k_base with ~100,277 tokens in its vocabulary
English text is token-efficient; non-Latin scripts, numbers, and URLs are expensive
Both API costs and context window limits are denominated in tokens
Paste any text into Token Compare to see live token visualization and counts

What is LLM Tokenization? A Plain-English Explanation

Tokens are not words

How Byte Pair Encoding (BPE) works

The cl100k_base encoding

Why some text costs more tokens than others

Word frequency in training data

Non-English languages

Numbers and special characters

Whitespace and formatting

What tokens mean for API costs and context windows

Token efficiency as a metric

Key takeaways

How to Count Tokens for GPT-4 & ChatGPT

How to Reduce Token Count in LLM Prompts