Tokenization: Byte-Pair Encoding

5/30

BPE is an algorithm that creates an efficient token vocabulary through these steps:

Start with basic units
Begin with individual characters or bytes as the initial tokens
Count pairs
Identify the most frequent adjacent token pairs in your corpus
Merge most common pair
Create a new token by combining the most frequent pair
Repeat
Continue merging until reaching desired vocabulary size (e.g., 50,000 tokens)

BPE balances vocabulary size and sequence length, making it ideal for LLMs.

Example Sequence (Simplified):

Training on: "the lower lower lowest"

1. Start with characters:

t h e _ l o w e r _ l o w e r _ l o w e s t

2. First merge (most common pair is 'l' + 'o'):

t h e _ lo w e r _ lo w e r _ lo w e s t

3. Second merge ('lo' + 'w'):

t h e _ low e r _ low e r _ low e s t

4. Continue merging until reaching vocabulary limit...

the _ lower _ lower _ lowest

Token Type	Examples	Observations
Common Words	the of and to	Single token for efficiency
Subwords	sub word ization	Common prefixes and suffixes
Rare/Complex	un common ly	Split into multiple tokens
Special Tokens	<s> </s> <pad>	Control tokens for the model

Understanding tokenization helps explain why LLMs sometimes struggle with character-level tasks like spelling or character counting.