Skip to content
Snippets Groups Projects

Draft: Add overview of E2E training

Open Vlad-Andrei BĂDOIU (78692) requested to merge vladb/e2e_overview into main
+ 4
0
@@ -2,6 +2,8 @@
## Dataset
Pipeline: url filtering -> text extraction -> language identification -> repetition removal -> document-wise filtering -> line-wise correction -> fuzzy deduplication -> exact deduplication -> signal filter
Techniques:
- CCNet pipeline
- fastText linear classifier for language filtering
@@ -9,6 +11,7 @@ Techniques:
Software:
- Custom setup over Ray
Empiric observations:
- Small fractions of code and multilingual data (5-10%), in line with common recipes for large language models, do not broadly impact zero-shot performance on English tasks
@@ -35,6 +38,7 @@ necessary from two to one per layer
* gradient accumulation
* z-loss - improves stability
* weight decay (AdamW)
* Flash attention
Optimizations:
* Alternative training objectives
Loading