After tokenization, several additional processing steps prepare data for the neural network:
Lists of token IDs form the primary input
Position information added to each token
Maximum sequence size (8K-128K tokens)
Each token represented as a vector (~1024-4096 dimensions)
Token IDs are converted to dense vector representations (embeddings) before processing:
Token ID
e.g., 42361
Embedding Vector
[0.1, -0.3, 0.8, 0.5, ..., -0.2]
Key Properties:
Notice how similar concepts cluster together, and relationships between pairs are preserved.
Context window of tokens
Model Predicts
Probability across vocabulary
During training, input windows are extracted from the dataset, and the model learns to predict what comes next.
This next-token prediction is the fundamental task of language modeling.