Stage | Purpose | Examples |
---|---|---|
URL Filtering | Remove spam/unsafe domains | Block lists, quality metrics |
Text Extraction | Pull useful content | HTML parsing, boilerplate removal |
Deduplication | Prevent learning from repetition | Hash-based filtering, n-gram overlap |
Quality Assessment | Prioritize valuable content | Classifier-based filtering |
FineWeb is a high-quality dataset created by HuggingFace, representative of what commercial LLM providers use:
Initial crawl
Quality standards
Made the final cut
High-quality text