LLM Learning Portal

Tokenization: Byte-Pair Encoding

5/30

How Byte-Pair Encoding Works

BPE is an algorithm that creates an efficient token vocabulary through these steps:

  1. Start with basic units

    Begin with individual characters or bytes as the initial tokens

  2. Count pairs

    Identify the most frequent adjacent token pairs in your corpus

  3. Merge most common pair

    Create a new token by combining the most frequent pair

  4. Repeat

    Continue merging until reaching desired vocabulary size (e.g., 50,000 tokens)

BPE balances vocabulary size and sequence length, making it ideal for LLMs.

BPE in Action

Example Sequence (Simplified):

Training on: "the lower lower lowest"

1. Start with characters:

t h e _ l o w e r _ l o w e r _ l o w e s t

2. First merge (most common pair is 'l' + 'o'):

t h e _ lo w e r _ lo w e r _ lo w e s t

3. Second merge ('lo' + 'w'):

t h e _ low e r _ low e r _ low e s t

4. Continue merging until reaching vocabulary limit...

the _ lower _ lower _ lowest

Common Token Examples

Token Type Examples Observations
Common Words the of and to Single token for efficiency
Subwords sub word ization Common prefixes and suffixes
Rare/Complex un common ly Split into multiple tokens
Special Tokens <s> </s> <pad> Control tokens for the model
Understanding tokenization helps explain why LLMs sometimes struggle with character-level tasks like spelling or character counting.