LLM Learning Portal

Pre-training: Data Processing

3/30

Data Cleaning Pipeline

  • URL filtering - removes unwanted domains
  • Text extraction - separates content from HTML
  • Language filtering - identifies and categorizes by language
  • Deduplication - removes redundant content
  • PII removal - Personal Identifiable Information redaction
  • Quality filtering - prioritizes high-value content
Raw internet data is messy and contains harmful content. Thorough cleaning is essential before training.

Processing Stages

Stage Purpose Examples
URL Filtering Remove spam/unsafe domains Block lists, quality metrics
Text Extraction Pull useful content HTML parsing, boilerplate removal
Deduplication Prevent learning from repetition Hash-based filtering, n-gram overlap
Quality Assessment Prioritize valuable content Classifier-based filtering

Example: FineWeb Dataset

FineWeb is a high-quality dataset created by HuggingFace, representative of what commercial LLM providers use:

100B+ Webpages

Initial crawl

99.9% Filtered

Quality standards

67M Docs

Made the final cut

44TB

High-quality text

The careful curation of training data is as important as the model architecture itself. Quality over quantity is essential.