Pre-training Process

9/30

Pre-training is the process of teaching an LLM to predict the next token by exposing it to vast amounts of text data.

Loss decreases as model learns patterns

LLMs learn in stages: simple patterns first, then grammar, facts, and eventually more complex reasoning.

Complex capabilities like reasoning and problem-solving emerge only after sufficient scale in data and model size.

Performance improves predictably as compute, data, and model size increase together.

Using lower precision (FP16/BF16) with careful handling of numerical stability

Compute gradients in smaller batches and update after multiple steps

Split model parameters, gradients, and optimizer states across devices

Carefully control learning rates over training time for stability

A single training run crash can cost millions of dollars, making stability essential.

"Many capabilities of large language models emerge naturally as we scale up training, without specific training objectives or architectural changes."

- Based on observations from GPT-3 and subsequent models