Alexandru-Mihai GHERGHESCU · b5975f1a · ecd075dc · 161ab0e9 · 536de844 · c4f7f774
--- a/scripts/memory_compute_estimations/README.md 0 → 100644

+ 83

− 0
+++ b/scripts/memory_compute_estimations/README.md 0 → 100644

+ 83

− 0
+# Compute and memory estimations for training LLM's
+There are a few scripts here useful for predicting compute/memory requirements
+for training large language models. They are quite basic, assuming no particular
+underlying hardware or software framework, however they give a good baseline,
+which can then be iterated upon.
+The estimations are given based on the usual hyperparameters chosen when
+training a Transformer-type LLM: number of layers, number of heads, embeddings
+dimension, batch size, training context length, vocabulary size. These should
+give a rough idea as to the memory requirements for a classical GPT-style
+decoder-only Transformer. Other modifications on top of the architecture (such
+as SwiGLU activations inside the feedforward layer, multi-query or grouped-query
+attention, rotary positional embeddings etc.) will likely change the values by a
+little bit, however not enough to justify a totally different approach to the
+estimations.
+Other types of model architectures (such as mixture of experts, BERT-style
+models etc.) should not rely too much on the estimations given by the scripts,
+since those use fundamentally different approaches.
+## Memory requirements
+Memory requirements are given by the script `memory_req.py`. Change the values
+at the top (or use the predefined defaults), run it and get the output. These
+assume full 32-bit floating point training (mixed precision will slightly
+decrease the total memory, since some of the activations will be calculated
+using 16-bit floating point; therefore, expect the activations to be slightly
+lower in memory usage, ideally halved if all activations are fp16).
+The memory value calculated by the script will, essentially, cover the model,
+the activations, the gradients and the optimizer state. If the model is big
+(talking about B's in model size), then it will most likely not fit into a
+single GPU. In this case, tensor-parallelism and pipeline parallelism are
+commonly used methods, and these account for a memory overhead, since the
+activations need to be copied and passed around. Tensor-parallelism is usually
+used intra-node (in a DGX node, for example), and pipeline parallelism is used
+inter-node (between DGX clusters). The final memory outputted by the script is
+therefore only the memory required for tensor and pipeline parallelism. Data
+parallel is oftentimes further used. Each data-parallel unit increases the
+number of needed GPU's. E.g. training a 65B model requires ~32 GPUs (which are 4
+DGX nodes, each with 8x A100's). This results in 8-way tensor parallel
+intra-node, and 4-way pipeline parallel inter-node. Scaling up then happens
+using data-parallel. For example, using 64-way data-parallel would result in a
+total number of GPUs of `32 (the base number of GPUs needed to hold the model,
+consisting in 4x DGX) * 64 (data-parallel, each unit adds a model on top) = 2048
+GPUs`.
+For a more detailed overview of the above, see [Nvidia's great blog post on
+scaling models using
+Megatron](https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/),
+as well as [scaling experiments using Megatron and AMD on the LUMI
+cluster](https://lumi-supercomputer.eu/scaling-the-pre-training-of-large-language-models-of-100b-parameters-to-thousands-of-amd-mi250x-gpus-on-lumi/).
+## Compute requirements
+Compute requirements for training models can be calculated using the script
+`compute_req.py`. Change the values at the top (or use predefined defaults), run
+it and get the output.
+Notice that total compute is not affected by either batch size or context
+length. Since the model needs to see the whole dataset anyway, it doesn't really
+matter how it is partitioned (it doesn't matter whether there are fewer big
+chunks, or more large chunks). Batch size and context length will, however,
+affect memory usage. Context length will also indirectly affect dataset size.
+The intuition is that bigger context would need more dataset tokens to be
+fully trained. Increasing context length should generally result in increasing
+dataset size, though the scaling is definitely not linear (it's a best-guess
+scenario).
+Be careful about the estimations given low numbers (low dataset size, a model
+with a low number of parameters etc.), as communication/software times will
+start to matter when the compute needed per step update is low. The GPU's
+usually work best when fed big matrices, which keeps them occupied more fully.
+# Running the scripts together
+> You probably want to first run `memory_req.py`, which outputs the number of
+> GPU's needed for baseline model parallelism (tensor + pipeline). Don't bother
+> too much about adjusting batch size, as gradient accumulation can be used to
+> increase that value without memory overhead. The total number of GPU's should
+> then be adapted in `compute_req.py`, and multiplied by whatever factor for
+> using data-parallel (2x, 3x, 4x etc.), as described above.