LLM Learning Portal

Tokenization: Basics

4/30

What is Tokenization?

Tokenization is the process of converting text into numerical tokens that can be processed by neural networks.

The Tokenization Process:

Raw Text
Tokens
Token IDs
  • Tokens are "atoms" of text for the model
  • Words often split into multiple tokens
  • Fixed vocabulary of possible tokens
  • GPT-4 uses ~100,000 tokens in vocabulary

Tokenization Example

Original Text:

Machine learning is fascinating!

Tokenized:

Machine learning is fascinating !

Token IDs:

[3782] [1243] [318] [8674] [0]

More Complex Example:

Supercalifragilisticexpialidocious

Tokenized:

Super cal ifrag ilistic expial idoc ious

Why Not Just Use Characters or Words?

Approach Pros Cons
Character-level
  • Small vocabulary (few hundred)
  • Can handle any text
  • Very long sequences
  • Less efficient for training
  • Lacks semantic units
Word-level
  • Semantically meaningful
  • Shorter sequences
  • Massive vocabulary (millions)
  • Can't handle new words
  • Memory inefficient
Subword (BPE)
  • Balanced vocabulary size
  • Handles common words efficiently
  • Can represent new words
  • More complex implementation
  • Requires careful training
Modern LLMs use subword tokenization approaches like Byte-Pair Encoding (BPE) to balance vocabulary size and sequence length.