Skip to content
Snippets Groups Projects

Draft: Add overview of E2E training

Open Vlad-Andrei BĂDOIU (78692) requested to merge vladb/e2e_overview into main
16 unresolved threads

Merge request reports

Ready to merge by members who can write to the target branch.
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
1 1 # LLM's
2 2
3 [Here](overview.md) is an overview of the SOTA model E2E training.
  • I would not include this information in doc/llm.md (would like to keep this strictly for relevant papers/blog posts). If there's information not already contained in a blog post, you could either:

    1. create the blog post yourself and link to it in doc/llm.md;
    2. create a new separate document with the relevant information, and link to it in README.md; suggestions for title are doc/llm-training.md or doc/llm-end-to-end-training.md
  • Please register or sign in to reply
  • doc/overview.md 0 → 100644
    1 # E2E model trainnig SOTA
    • Please don't use abbreviations. They create unnecessary extra mental burden. Also, there's no need to mention SOTA (it's implied that you wouldn't add information about something that's not SOTA; also, SOTA is relative; if somebody reads the docs in 1 year, SOTA will almost certainly have changed by then; instead mention dates of papers / models etc.).

      Change to: "End-to-end model training".

    • Please register or sign in to reply
  • doc/overview.md 0 → 100644
    1 # E2E model trainnig SOTA
    2
    3 ## Dataset
    4
    5 Techniques:
  • doc/overview.md 0 → 100644
    1 # E2E model trainnig SOTA
    2
    3 ## Dataset
    4
    5 Techniques:
    6 - CCNet pipeline
    7 - fastText linear classifier for language filtering
    8
    9 Software:
    10 - Custom setup over Ray
  • doc/overview.md 0 → 100644
    1 # E2E model trainnig SOTA
    2
    3 ## Dataset
    4
    5 Techniques:
    6 - CCNet pipeline
    7 - fastText linear classifier for language filtering
    8
    9 Software:
    10 - Custom setup over Ray
    11
    12 Empiric observations:
    13 - Small fractions of code and multilingual data (5-10%), in line with common recipes for large language models, do not broadly impact zero-shot performance on English tasks
  • doc/overview.md 0 → 100644
    13 - Small fractions of code and multilingual data (5-10%), in line with common recipes for large language models, do not broadly impact zero-shot performance on English tasks
    14
    15 ## Model Architecture
    16
    17 - Pre-normalization - normalize the input of each transformer sub-layer
    18 - Rotary Embeddings - replace absolute positional embeddings
    19 - Replace the feed-forward blocks by Mixture-of-Expert layers (MoE)
    20 - no biases - improves stability
    21
    22 Optimizations
    23 - GLU based activation
    24 - Parallel attention and MLP blocks - reduce the communication
    25 costs associated with tensor parallelism: this simple modification cuts the number of all_reduce
    26 necessary from two to one per layer
    27 - grouped-query attention (GQA) - faster inference
    28 - sliding window attention (SWA) - reduces computational complexity from quadratic to linear with respect to input sequence length
  • doc/overview.md 0 → 100644
    5 Techniques:
    6 - CCNet pipeline
    7 - fastText linear classifier for language filtering
    8
    9 Software:
    10 - Custom setup over Ray
    11
    12 Empiric observations:
    13 - Small fractions of code and multilingual data (5-10%), in line with common recipes for large language models, do not broadly impact zero-shot performance on English tasks
    14
    15 ## Model Architecture
    16
    17 - Pre-normalization - normalize the input of each transformer sub-layer
    18 - Rotary Embeddings - replace absolute positional embeddings
    19 - Replace the feed-forward blocks by Mixture-of-Expert layers (MoE)
    20 - no biases - improves stability
  • doc/overview.md 0 → 100644
    18 - Rotary Embeddings - replace absolute positional embeddings
    19 - Replace the feed-forward blocks by Mixture-of-Expert layers (MoE)
    20 - no biases - improves stability
    21
    22 Optimizations
    23 - GLU based activation
    24 - Parallel attention and MLP blocks - reduce the communication
    25 costs associated with tensor parallelism: this simple modification cuts the number of all_reduce
    26 necessary from two to one per layer
    27 - grouped-query attention (GQA) - faster inference
    28 - sliding window attention (SWA) - reduces computational complexity from quadratic to linear with respect to input sequence length
    29
    30
    31 ## Training
    32
    33
  • doc/overview.md 0 → 100644
    14
    15 ## Model Architecture
    16
    17 - Pre-normalization - normalize the input of each transformer sub-layer
    18 - Rotary Embeddings - replace absolute positional embeddings
    19 - Replace the feed-forward blocks by Mixture-of-Expert layers (MoE)
    20 - no biases - improves stability
    21
    22 Optimizations
    23 - GLU based activation
    24 - Parallel attention and MLP blocks - reduce the communication
    25 costs associated with tensor parallelism: this simple modification cuts the number of all_reduce
    26 necessary from two to one per layer
    27 - grouped-query attention (GQA) - faster inference
    28 - sliding window attention (SWA) - reduces computational complexity from quadratic to linear with respect to input sequence length
    29
  • doc/overview.md 0 → 100644
    23 - GLU based activation
    24 - Parallel attention and MLP blocks - reduce the communication
    25 costs associated with tensor parallelism: this simple modification cuts the number of all_reduce
    26 necessary from two to one per layer
    27 - grouped-query attention (GQA) - faster inference
    28 - sliding window attention (SWA) - reduces computational complexity from quadratic to linear with respect to input sequence length
    29
    30
    31 ## Training
    32
    33
    34 * gradient cliping
    35 * gradient accumulation
    36 * z-loss - improves stability
    37 * weight decay (AdamW)
    38 * Flash attention
    • Comment on lines +31 to +38

      Please add links or detail how each one improves training. Moreso, I would not necessarily include these here at all, because:

      1. The list is incomplete (for example, mixed precision 16-bit training isn't mentioned)
      2. They are not really related to each other (e.g. FlashAttention is a hardware optimization, which is orthogonal to other software techniques used)

      I would probably change this list to include more information, and I would detail more where it is unclear what each thing does (e.g. gradient clipping is quite an old method used, same as gradient accumulation; those don't need much explaining; Flash Attention, z-loss might need links or more explaining). Also, weight decay is a method which is not guaranteed to improve training. It's rather empirical, as it seems to me that some (most) models do not make use of either weight decay or model dropout, as those are only needed for small datasets. Therefore, I would add info about when and how it helps.

      Format lists as '-' instead of '*'.

      Be consistent when starting/ending list items (start with capital letters, end without period).

    • Please register or sign in to reply
  • doc/overview.md 0 → 100644
    29
    30
    31 ## Training
    32
    33
    34 * gradient cliping
    35 * gradient accumulation
    36 * z-loss - improves stability
    37 * weight decay (AdamW)
    38 * Flash attention
    39
    40 Optimizations:
    41 * Alternative training objectives
    42 * fill-in-the-middle (FIM) training
    43 * Principled hyperparameters.
    44 * gradient checkpointing - reduces memory usage
    • Comment on lines +40 to +44

      I would merge this list and the above. FlashAttention is also an optimization, if we're to be pedantic.

      Add links to relevant papers/documentation.

      Format lists as '-' instead of '*'.

      Be consistent when starting/ending list items (start with capital letters, end without period).

    • Please register or sign in to reply
  • doc/overview.md 0 → 100644
    40 Optimizations:
    41 * Alternative training objectives
    42 * fill-in-the-middle (FIM) training
    43 * Principled hyperparameters.
    44 * gradient checkpointing - reduces memory usage
    45
    46 ```
    47 > BEFORE
    48 Time: 57.82
    49 Samples/second: 8.86
    50 GPU memory: 14949 MB
    51
    52 > AFTER
    53 Time: 66.03
    54 Samples/second: 7.75
    55 GPU memory: 8681 MB
    • Comment on lines +47 to +55

      I would not necessarily add this here, I think this one single test run on whatever hardware doesn't really get the point across, as it could be an isolated case. Either add links to more such tests (e.g. in a blog post), or link to the relevant gradient checkpointing paper.

    • Please register or sign in to reply
  • doc/overview.md 0 → 100644
    46 ```
    47 > BEFORE
    48 Time: 57.82
    49 Samples/second: 8.86
    50 GPU memory: 14949 MB
    51
    52 > AFTER
    53 Time: 66.03
    54 Samples/second: 7.75
    55 GPU memory: 8681 MB
    56 ```
    57
    58 - checkpointing - save the activations that
    59 are expensive to compute, such as the outputs of
    60 linear layers (needs model and sequence parallelism and custom
    61 backwards pass, not Pytorch autograd)
  • doc/overview.md 0 → 100644
    47 > BEFORE
    48 Time: 57.82
    49 Samples/second: 8.86
    50 GPU memory: 14949 MB
    51
    52 > AFTER
    53 Time: 66.03
    54 Samples/second: 7.75
    55 GPU memory: 8681 MB
    56 ```
    57
    58 - checkpointing - save the activations that
    59 are expensive to compute, such as the outputs of
    60 linear layers (needs model and sequence parallelism and custom
    61 backwards pass, not Pytorch autograd)
    62 - overlap computation of activations and the communication between GPUs over the network
  • doc/overview.md 0 → 100644
    62 - overlap computation of activations and the communication between GPUs over the network
    63
    64
    65 - 3D parallelism
    66 - optimizer sharding
    67
    68 ## Post training
    69
    70 Supervised Fine-Tuning (SFT)
    71 Reinforcement Learning with Human Feedback (RLHF)
    72 - Proximal Policy Optimization (PPO)
    73 - Rejection Sampling fine-tuning.
    74
    75 Reward Modeling
    76
    77 ![GPT 3.5 Post Training](img/chat_diagram.svg)
  • doc/overview.md 0 → 100644
    65 - 3D parallelism
    66 - optimizer sharding
    67
    68 ## Post training
    69
    70 Supervised Fine-Tuning (SFT)
    71 Reinforcement Learning with Human Feedback (RLHF)
    72 - Proximal Policy Optimization (PPO)
    73 - Rejection Sampling fine-tuning.
    74
    75 Reward Modeling
    76
    77 ![GPT 3.5 Post Training](img/chat_diagram.svg)
    78
    79
    80 ## Evaluation
    • Perhaps evaluation / benchmarking should be extended to more information. For example, it would be relevant to mention the tasks on which models are usually tested (e.g. natural lanugage understanding, math and coding problems, quizz and trivia questions, reasoning and common sense etc.).

    • Please register or sign in to reply
  • Left my review. Overall it's a good structure, but I feel like there needs to be more details added. For example, there should most certainly be a section about dataset collection/filtering etc. (i.e. dataset filtering and CCNet pipelining should in itself be a section). Similarly benchmarking needs to be expanded. Also, I would probably add models (or links to papers) where techniques / optimizations are concerned.

  • Alexandru-Mihai GHERGHESCU marked this merge request as draft

    marked this merge request as draft

  • added 1 commit

    • 4e7ebcaa - Add overview of E2E training

    Compare with previous version

  • Please register or sign in to reply
    Loading