Pre-training: Data Collection

2/30

Example: Common Crawl

Common Crawl has indexed billions of web pages since 2007, creating one of the largest repositories of internet text.

The volume of data is crucial for teaching models about language patterns, knowledge, and reasoning.

Web Crawlers

Follow links to discover content

Filter & Extract

Remove unwanted content

Language Processing

Identify and categorize language

Deduplication

Remove redundant content

Storage

Preserve for training