Tokenization: Basics

4/30

Tokenization is the process of converting text into numerical tokens that can be processed by neural networks.

Raw Text

Tokens

Token IDs

Original Text:

Machine learning is fascinating!

Tokenized:

Machine learning is fascinating !

Token IDs:

[3782] [1243] [318] [8674] [0]

More Complex Example:

Supercalifragilisticexpialidocious

Tokenized:

Super cal ifrag ilistic expial idoc ious

Approach	Pros	Cons
Character-level	Small vocabulary (few hundred) Can handle any text	Very long sequences Less efficient for training Lacks semantic units
Word-level	Semantically meaningful Shorter sequences	Massive vocabulary (millions) Can't handle new words Memory inefficient
Subword (BPE)	Balanced vocabulary size Handles common words efficiently Can represent new words	More complex implementation Requires careful training

Modern LLMs use subword tokenization approaches like Byte-Pair Encoding (BPE) to balance vocabulary size and sequence length.