BPE is an algorithm that creates an efficient token vocabulary through these steps:
Begin with individual characters or bytes as the initial tokens
Identify the most frequent adjacent token pairs in your corpus
Create a new token by combining the most frequent pair
Continue merging until reaching desired vocabulary size (e.g., 50,000 tokens)
Example Sequence (Simplified):
Training on: "the lower lower lowest"
1. Start with characters:
2. First merge (most common pair is 'l' + 'o'):
3. Second merge ('lo' + 'w'):
4. Continue merging until reaching vocabulary limit...
Token Type | Examples | Observations |
---|---|---|
Common Words | the of and to | Single token for efficiency |
Subwords | sub word ization | Common prefixes and suffixes |
Rare/Complex | un common ly | Split into multiple tokens |
Special Tokens | <s> </s> <pad> | Control tokens for the model |