Pipeline: url filtering -> text extraction -> language identification -> repetition removal -> document-wise filtering -> line-wise correction -> fuzzy deduplication -> exact deduplication -> signal filter
Techniques:
- CCNet pipeline
- fastText linear classifier for language filtering
@@ -9,6 +11,7 @@ Techniques:
Software:
- Custom setup over Ray
Empiric observations:
- Small fractions of code and multilingual data (5-10%), in line with common recipes for large language models, do not broadly impact zero-shot performance on English tasks
@@ -35,6 +38,7 @@ necessary from two to one per layer