LLM Learning Portal

Pre-training: Data Collection

2/30

Where Does the Data Come From?

  • Internet text serves as primary training data
  • Common Crawl indexes billions of web pages
  • Digital books and articles
  • Wikipedia and encyclopedias
  • Code repositories
  • Public forums and discussions
  • Government documents
Example: Common Crawl

Common Crawl has indexed billions of web pages since 2007, creating one of the largest repositories of internet text.

Data Scale & Diversity

Dataset Example Size
Fine Web (HuggingFace) ~44 TB
Common Crawl (2023) 100+ TB
Books & Documents 10+ TB

The volume of data is crucial for teaching models about language patterns, knowledge, and reasoning.

Data Collection Pipeline

Web Crawlers

Follow links to discover content

Filter & Extract

Remove unwanted content

Language Processing

Identify and categorize language

Deduplication

Remove redundant content

Storage

Preserve for training