Pre-training is the process of teaching an LLM to predict the next token by exposing it to vast amounts of text data.
Start with random weights across all layers
Process chunks of text from the corpus in parallel
Generate predictions for each token position
Measure prediction errors across the vocabulary
Compute gradients to adjust model weights
Improve the model using optimization algorithms
Continue until convergence or resource limits
LLMs learn in stages: simple patterns first, then grammar, facts, and eventually more complex reasoning.
Complex capabilities like reasoning and problem-solving emerge only after sufficient scale in data and model size.
Performance improves predictably as compute, data, and model size increase together.
Resource | Scale |
---|---|
GPUs/TPUs | Thousands of accelerators |
Duration | Weeks to months |
Power | Megawatts of electricity |
Cost | $1-100+ million USD |
Using lower precision (FP16/BF16) with careful handling of numerical stability
Compute gradients in smaller batches and update after multiple steps
Split model parameters, gradients, and optimizer states across devices
Carefully control learning rates over training time for stability
"Many capabilities of large language models emerge naturally as we scale up training, without specific training objectives or architectural changes."
- Based on observations from GPT-3 and subsequent models