Tokenization is the process of converting text into numerical tokens that can be processed by neural networks.
Original Text:
Machine learning is fascinating!
Tokenized:
Token IDs:
More Complex Example:
Supercalifragilisticexpialidocious
Tokenized:
Approach | Pros | Cons |
---|---|---|
Character-level |
|
|
Word-level |
|
|
Subword (BPE) |
|
|