Alexandru-Mihai GHERGHESCU · Alexandru-Mihai GHERGHESCU · b5975f1a · ecd075dc · 161ab0e9 · 536de844
--- a/scripts/memory_compute_estimations/README.md 0 → 100644

+ 53

− 0
+++ b/scripts/memory_compute_estimations/README.md 0 → 100644

+ 53

− 0
+# Compute and memory estimations for training LLM's
+
+There are a few scripts here useful for predicting compute/memory requirements
+for training large language models. They are quite basic, assuming no particular
+underlying hardware or software framework, however they give a good baseline,
+which can then be iterated upon.
+
+The estimations are given based on the usual hyperparameters chosen when
+training a Transformer-type LLM: number of layers, number of heads, embeddings
+dimension, batch size, training context length, vocabulary size. These should
+give a rough idea as to the memory requirements for a classical GPT-style
+decoder-only Transformer. Other modifications on top of the architecture (such
+as SwiGLU activations inside the feedforward layer, multi-query or grouped-query
+attention, rotary positional embeddings etc.) will likely change the values by a
+little bit, however not enough to justify a totally different approach to the
+estimations.
+
+Other types of model architectures (such as mixture of experts, BERT-style
+models etc.) should not rely too much on the estimations given by the scripts,
+since those use fundamentally different approaches.
+
+## Memory requirements
+
+Memory requirements are given by the script `memory_req.py`. Change the values
+at the top (or use the predefined defaults), run it and get the output. These
+assume full 32-bit floating point training (mixed precision will slightly
+decrease the total memory, since some of the activations will be calculated
+using 16-bit floating point; therefore, expect the activations to be slightly
+lower in memory usage, ideally halved if all activations are fp16).
+
+The memory value calculated by the script will, essentially, cover the model,
+the activations, the gradients and the optimizer state. If the model is big
+(talking about B's in model size), then it will most likely not fit into a
+single GPU. In this case, tensor-parallelism and pipeline parallelism are
+commonly used methods, and these account for a memory overhead, since the
+activations need to be copied and passed around. Tensor-parallelism is usually
+used intra-node (in a DGX node, for example), and pipeline parallelism is used
+inter-node (between DGX clusters). The final memory outputted by the script is
+therefore only the memory required for tensor and pipeline parallelism. Data
+parallel is oftentimes further used. Each data-parallel unit increases the
+number of needed GPU's. E.g. training a 65B model requires ~32 GPUs (which are 4
+DGX nodes, each with 8x A100's). This results in 8-way tensor parallel
+intra-node, and 4-way pipeline parallel inter-node. Scaling up then happens
+using data-parallel. For example, using 32-way data-parallel would result in a
+total number of GPUs of `32 (the base number of GPUs needed to hold the model,
+consisting in 4x DGX) * 32 (data-parallel, each unit adds a model on top) = 1024
+GPUs`.
+
+For a more detailed overview of the above, see [Nvidia's great blog post on
+scaling models using
+Megatron](https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/),
+as well as [scaling experiments using Megatron and AMD on the LUMI
+cluster](https://lumi-supercomputer.eu/scaling-the-pre-training-of-large-language-models-of-100b-parameters-to-thousands-of-amd-mi250x-gpus-on-lumi/).