Draft: Add overview of E2E training
Merge request reports
Activity
requested review from @agherghescu2411
assigned to @agherghescu2411
assigned to @vlad_andrei.badoiu1 and unassigned @agherghescu2411
1 1 # LLM's 2 2 3 [Here](overview.md) is an overview of the SOTA model E2E training. I would not include this information in
doc/llm.md
(would like to keep this strictly for relevant papers/blog posts). If there's information not already contained in a blog post, you could either:- create the blog post yourself and link to it in
doc/llm.md
; - create a new separate document with the relevant information, and link to it in
README.md
; suggestions for title aredoc/llm-training.md
ordoc/llm-end-to-end-training.md
- create the blog post yourself and link to it in
- doc/overview.md 0 → 100644
1 # E2E model trainnig SOTA Please don't use abbreviations. They create unnecessary extra mental burden. Also, there's no need to mention SOTA (it's implied that you wouldn't add information about something that's not SOTA; also, SOTA is relative; if somebody reads the docs in 1 year, SOTA will almost certainly have changed by then; instead mention dates of papers / models etc.).
Change to: "End-to-end model training".
- doc/overview.md 0 → 100644
- doc/overview.md 0 → 100644
1 # E2E model trainnig SOTA 2 3 ## Dataset 4 5 Techniques: 6 - CCNet pipeline 7 - fastText linear classifier for language filtering 8 9 Software: 10 - Custom setup over Ray - doc/overview.md 0 → 100644
1 # E2E model trainnig SOTA 2 3 ## Dataset 4 5 Techniques: 6 - CCNet pipeline 7 - fastText linear classifier for language filtering 8 9 Software: 10 - Custom setup over Ray 11 12 Empiric observations: 13 - Small fractions of code and multilingual data (5-10%), in line with common recipes for large language models, do not broadly impact zero-shot performance on English tasks - doc/overview.md 0 → 100644
13 - Small fractions of code and multilingual data (5-10%), in line with common recipes for large language models, do not broadly impact zero-shot performance on English tasks 14 15 ## Model Architecture 16 17 - Pre-normalization - normalize the input of each transformer sub-layer 18 - Rotary Embeddings - replace absolute positional embeddings 19 - Replace the feed-forward blocks by Mixture-of-Expert layers (MoE) 20 - no biases - improves stability 21 22 Optimizations 23 - GLU based activation 24 - Parallel attention and MLP blocks - reduce the communication 25 costs associated with tensor parallelism: this simple modification cuts the number of all_reduce 26 necessary from two to one per layer 27 - grouped-query attention (GQA) - faster inference 28 - sliding window attention (SWA) - reduces computational complexity from quadratic to linear with respect to input sequence length - doc/overview.md 0 → 100644
5 Techniques: 6 - CCNet pipeline 7 - fastText linear classifier for language filtering 8 9 Software: 10 - Custom setup over Ray 11 12 Empiric observations: 13 - Small fractions of code and multilingual data (5-10%), in line with common recipes for large language models, do not broadly impact zero-shot performance on English tasks 14 15 ## Model Architecture 16 17 - Pre-normalization - normalize the input of each transformer sub-layer 18 - Rotary Embeddings - replace absolute positional embeddings 19 - Replace the feed-forward blocks by Mixture-of-Expert layers (MoE) 20 - no biases - improves stability - doc/overview.md 0 → 100644
18 - Rotary Embeddings - replace absolute positional embeddings 19 - Replace the feed-forward blocks by Mixture-of-Expert layers (MoE) 20 - no biases - improves stability 21 22 Optimizations 23 - GLU based activation 24 - Parallel attention and MLP blocks - reduce the communication 25 costs associated with tensor parallelism: this simple modification cuts the number of all_reduce 26 necessary from two to one per layer 27 - grouped-query attention (GQA) - faster inference 28 - sliding window attention (SWA) - reduces computational complexity from quadratic to linear with respect to input sequence length 29 30 31 ## Training 32 33 - doc/overview.md 0 → 100644
14 15 ## Model Architecture 16 17 - Pre-normalization - normalize the input of each transformer sub-layer 18 - Rotary Embeddings - replace absolute positional embeddings 19 - Replace the feed-forward blocks by Mixture-of-Expert layers (MoE) 20 - no biases - improves stability 21 22 Optimizations 23 - GLU based activation 24 - Parallel attention and MLP blocks - reduce the communication 25 costs associated with tensor parallelism: this simple modification cuts the number of all_reduce 26 necessary from two to one per layer 27 - grouped-query attention (GQA) - faster inference 28 - sliding window attention (SWA) - reduces computational complexity from quadratic to linear with respect to input sequence length 29 - doc/overview.md 0 → 100644
23 - GLU based activation 24 - Parallel attention and MLP blocks - reduce the communication 25 costs associated with tensor parallelism: this simple modification cuts the number of all_reduce 26 necessary from two to one per layer 27 - grouped-query attention (GQA) - faster inference 28 - sliding window attention (SWA) - reduces computational complexity from quadratic to linear with respect to input sequence length 29 30 31 ## Training 32 33 34 * gradient cliping 35 * gradient accumulation 36 * z-loss - improves stability 37 * weight decay (AdamW) 38 * Flash attention - Comment on lines +31 to +38
Please add links or detail how each one improves training. Moreso, I would not necessarily include these here at all, because:
- The list is incomplete (for example, mixed precision 16-bit training isn't mentioned)
- They are not really related to each other (e.g. FlashAttention is a hardware optimization, which is orthogonal to other software techniques used)
I would probably change this list to include more information, and I would detail more where it is unclear what each thing does (e.g. gradient clipping is quite an old method used, same as gradient accumulation; those don't need much explaining; Flash Attention, z-loss might need links or more explaining). Also, weight decay is a method which is not guaranteed to improve training. It's rather empirical, as it seems to me that some (most) models do not make use of either weight decay or model dropout, as those are only needed for small datasets. Therefore, I would add info about when and how it helps.
Format lists as '-' instead of '*'.
Be consistent when starting/ending list items (start with capital letters, end without period).
- doc/overview.md 0 → 100644
29 30 31 ## Training 32 33 34 * gradient cliping 35 * gradient accumulation 36 * z-loss - improves stability 37 * weight decay (AdamW) 38 * Flash attention 39 40 Optimizations: 41 * Alternative training objectives 42 * fill-in-the-middle (FIM) training 43 * Principled hyperparameters. 44 * gradient checkpointing - reduces memory usage - Comment on lines +40 to +44
- doc/overview.md 0 → 100644
40 Optimizations: 41 * Alternative training objectives 42 * fill-in-the-middle (FIM) training 43 * Principled hyperparameters. 44 * gradient checkpointing - reduces memory usage 45 46 ``` 47 > BEFORE 48 Time: 57.82 49 Samples/second: 8.86 50 GPU memory: 14949 MB 51 52 > AFTER 53 Time: 66.03 54 Samples/second: 7.75 55 GPU memory: 8681 MB - Comment on lines +47 to +55
- doc/overview.md 0 → 100644
46 ``` 47 > BEFORE 48 Time: 57.82 49 Samples/second: 8.86 50 GPU memory: 14949 MB 51 52 > AFTER 53 Time: 66.03 54 Samples/second: 7.75 55 GPU memory: 8681 MB 56 ``` 57 58 - checkpointing - save the activations that 59 are expensive to compute, such as the outputs of 60 linear layers (needs model and sequence parallelism and custom 61 backwards pass, not Pytorch autograd) - doc/overview.md 0 → 100644
47 > BEFORE 48 Time: 57.82 49 Samples/second: 8.86 50 GPU memory: 14949 MB 51 52 > AFTER 53 Time: 66.03 54 Samples/second: 7.75 55 GPU memory: 8681 MB 56 ``` 57 58 - checkpointing - save the activations that 59 are expensive to compute, such as the outputs of 60 linear layers (needs model and sequence parallelism and custom 61 backwards pass, not Pytorch autograd) 62 - overlap computation of activations and the communication between GPUs over the network - doc/overview.md 0 → 100644
62 - overlap computation of activations and the communication between GPUs over the network 63 64 65 - 3D parallelism 66 - optimizer sharding 67 68 ## Post training 69 70 Supervised Fine-Tuning (SFT) 71 Reinforcement Learning with Human Feedback (RLHF) 72 - Proximal Policy Optimization (PPO) 73 - Rejection Sampling fine-tuning. 74 75 Reward Modeling 76 77  - doc/overview.md 0 → 100644
65 - 3D parallelism 66 - optimizer sharding 67 68 ## Post training 69 70 Supervised Fine-Tuning (SFT) 71 Reinforcement Learning with Human Feedback (RLHF) 72 - Proximal Policy Optimization (PPO) 73 - Rejection Sampling fine-tuning. 74 75 Reward Modeling 76 77  78 79 80 ## Evaluation Left my review. Overall it's a good structure, but I feel like there needs to be more details added. For example, there should most certainly be a section about dataset collection/filtering etc. (i.e. dataset filtering and CCNet pipelining should in itself be a section). Similarly benchmarking needs to be expanded. Also, I would probably add models (or links to papers) where techniques / optimizations are concerned.