Vlad-Andrei BĂDOIU (78692) · Vlad-Andrei BĂDOIU (78692)
--- a/doc/llm.md
+++ b/doc/llm.md
 # LLM's

+[Here](overview.md) is an overview of the SOTA model E2E training.
--- a/doc/overview.md 0 → 100644
+++ b/doc/overview.md 0 → 100644
+# E2E model trainnig SOTA
--- a/doc/overview.md 0 → 100644
+++ b/doc/overview.md 0 → 100644
+# E2E model trainnig SOTA
+
+## Dataset
+
+Techniques:
--- a/doc/overview.md 0 → 100644
+++ b/doc/overview.md 0 → 100644
+# E2E model trainnig SOTA
+
+## Dataset
+
+Techniques:
+- CCNet pipeline
+- fastText linear classifier for language filtering
+
+Software:
+- Custom setup over Ray
--- a/doc/overview.md 0 → 100644
+++ b/doc/overview.md 0 → 100644
+# E2E model trainnig SOTA
+
+## Dataset
+
+Techniques:
+- CCNet pipeline
+- fastText linear classifier for language filtering
+
+Software:
+- Custom setup over Ray
+
+Empiric observations:
+- Small fractions of code and multilingual data (5-10%), in line with common recipes for large language models, do not broadly impact zero-shot performance on English tasks
--- a/doc/overview.md 0 → 100644
+++ b/doc/overview.md 0 → 100644
+- Small fractions of code and multilingual data (5-10%), in line with common recipes for large language models, do not broadly impact zero-shot performance on English tasks
+
+## Model Architecture
+
+- Pre-normalization - normalize the input of each transformer sub-layer
+- Rotary Embeddings - replace absolute positional embeddings
+- Replace the feed-forward blocks by Mixture-of-Expert layers (MoE)
+- no biases - improves stability
+
+Optimizations
+- GLU based activation
+- Parallel attention and MLP blocks - reduce the communication
+costs associated with tensor parallelism: this simple modification cuts the number of all_reduce
+necessary from two to one per layer
+- grouped-query attention (GQA) - faster inference
+- sliding window attention (SWA) - reduces computational complexity from quadratic to linear with respect to input sequence length
--- a/doc/overview.md 0 → 100644
+++ b/doc/overview.md 0 → 100644
+Techniques:
+- CCNet pipeline
+- fastText linear classifier for language filtering
+
+Software:
+- Custom setup over Ray
+
+Empiric observations:
+- Small fractions of code and multilingual data (5-10%), in line with common recipes for large language models, do not broadly impact zero-shot performance on English tasks
+
+## Model Architecture
+
+- Pre-normalization - normalize the input of each transformer sub-layer
+- Rotary Embeddings - replace absolute positional embeddings
+- Replace the feed-forward blocks by Mixture-of-Expert layers (MoE)
+- no biases - improves stability
--- a/doc/overview.md 0 → 100644
+++ b/doc/overview.md 0 → 100644
+- Rotary Embeddings - replace absolute positional embeddings
+- Replace the feed-forward blocks by Mixture-of-Expert layers (MoE)
+- no biases - improves stability
+
+Optimizations
+- GLU based activation
+- Parallel attention and MLP blocks - reduce the communication
+costs associated with tensor parallelism: this simple modification cuts the number of all_reduce
+necessary from two to one per layer
+- grouped-query attention (GQA) - faster inference
+- sliding window attention (SWA) - reduces computational complexity from quadratic to linear with respect to input sequence length
+ 
+
+## Training
+
+
--- a/doc/overview.md 0 → 100644
+++ b/doc/overview.md 0 → 100644
+
+## Model Architecture
+
+- Pre-normalization - normalize the input of each transformer sub-layer
+- Rotary Embeddings - replace absolute positional embeddings
+- Replace the feed-forward blocks by Mixture-of-Expert layers (MoE)
+- no biases - improves stability
+
+Optimizations
+- GLU based activation
+- Parallel attention and MLP blocks - reduce the communication
+costs associated with tensor parallelism: this simple modification cuts the number of all_reduce
+necessary from two to one per layer
+- grouped-query attention (GQA) - faster inference
+- sliding window attention (SWA) - reduces computational complexity from quadratic to linear with respect to input sequence length
+ 
--- a/doc/overview.md 0 → 100644
+++ b/doc/overview.md 0 → 100644
+- GLU based activation
+- Parallel attention and MLP blocks - reduce the communication
+costs associated with tensor parallelism: this simple modification cuts the number of all_reduce
+necessary from two to one per layer
+- grouped-query attention (GQA) - faster inference
+- sliding window attention (SWA) - reduces computational complexity from quadratic to linear with respect to input sequence length
+ 
+
+## Training
+
+
+* gradient cliping
+* gradient accumulation
+* z-loss - improves stability
+* weight decay (AdamW)
+* Flash attention
--- a/doc/overview.md 0 → 100644
+++ b/doc/overview.md 0 → 100644
+ 
+
+## Training
+
+
+* gradient cliping
+* gradient accumulation
+* z-loss - improves stability
+* weight decay (AdamW)
+* Flash attention
+
+Optimizations:
+* Alternative training objectives
+* fill-in-the-middle (FIM) training
+* Principled hyperparameters.
+* gradient checkpointing - reduces memory usage 
--- a/doc/overview.md 0 → 100644
+++ b/doc/overview.md 0 → 100644
+Optimizations:
+* Alternative training objectives
+* fill-in-the-middle (FIM) training
+* Principled hyperparameters.
+* gradient checkpointing - reduces memory usage 
+
+```
+> BEFORE
+Time: 57.82
+Samples/second: 8.86
+GPU memory: 14949 MB
+
+> AFTER
+Time: 66.03
+Samples/second: 7.75
+GPU memory: 8681 MB
--- a/doc/overview.md 0 → 100644
+++ b/doc/overview.md 0 → 100644
+```
+> BEFORE
+Time: 57.82
+Samples/second: 8.86
+GPU memory: 14949 MB
+
+> AFTER
+Time: 66.03
+Samples/second: 7.75
+GPU memory: 8681 MB
+```
+
+- checkpointing - save the activations that
+are expensive to compute, such as the outputs of
+linear layers (needs model and sequence parallelism and custom
+backwards pass, not Pytorch autograd)
--- a/doc/overview.md 0 → 100644
+++ b/doc/overview.md 0 → 100644
+> BEFORE
+Time: 57.82
+Samples/second: 8.86
+GPU memory: 14949 MB
+
+> AFTER
+Time: 66.03
+Samples/second: 7.75
+GPU memory: 8681 MB
+```
+
+- checkpointing - save the activations that
+are expensive to compute, such as the outputs of
+linear layers (needs model and sequence parallelism and custom
+backwards pass, not Pytorch autograd)
+- overlap computation of activations and the communication between GPUs over the network
--- a/doc/overview.md 0 → 100644
+++ b/doc/overview.md 0 → 100644
+- overlap computation of activations and the communication between GPUs over the network
+  
+
+- 3D parallelism
+- optimizer sharding
+
+## Post training
+
+Supervised Fine-Tuning (SFT)
+Reinforcement Learning with Human Feedback (RLHF)
+- Proximal Policy Optimization (PPO)
+- Rejection Sampling fine-tuning.
+
+Reward Modeling
+
+![GPT 3.5 Post Training](img/chat_diagram.svg)
--- a/doc/overview.md 0 → 100644
+++ b/doc/overview.md 0 → 100644
+- 3D parallelism
+- optimizer sharding
+
+## Post training
+
+Supervised Fine-Tuning (SFT)
+Reinforcement Learning with Human Feedback (RLHF)
+- Proximal Policy Optimization (PPO)
+- Rejection Sampling fine-tuning.
+
+Reward Modeling
+
+![GPT 3.5 Post Training](img/chat_diagram.svg)
+
+
+## Evaluation