Vlad-Andrei BĂDOIU (78692) · 2e9fc741 · 896a020b · 4e7ebcaa · 2e9fc741 · 896a020b
--- a/doc/overview.md

+ 4

− 0
+++ b/doc/overview.md

+ 4

− 0
 @@ -2,6 +2,8 @@

 ## Dataset

+Pipeline: url filtering -> text extraction -> language identification ->  repetition removal -> document-wise filtering -> line-wise correction -> fuzzy deduplication -> exact deduplication -> signal filter 
+
 Techniques:
 - CCNet pipeline
 - fastText linear classifier for language filtering
 @@ -9,6 +11,7 @@ Techniques:
 Software:
 - Custom setup over Ray

+
 Empiric observations:
 - Small fractions of code and multilingual data (5-10%), in line with common recipes for large language models, do not broadly impact zero-shot performance on English tasks

 @@ -35,6 +38,7 @@ necessary from two to one per layer
 * gradient accumulation
 * z-loss - improves stability
 * weight decay (AdamW)
+* Flash attention

 Optimizations:
 * Alternative training objectives