Alexandru-Mihai GHERGHESCU · b5975f1a · ecd075dc · 161ab0e9 · 536de844 · c4f7f774
--- a/scripts/memory_compute_estimations/README.md

+ 29

− 19
+++ b/scripts/memory_compute_estimations/README.md

+ 29

− 19
 @@ -19,25 +19,35 @@ Other types of model architectures (such as mixture of experts, BERT-style
 models etc.) should not rely too much on the estimations given by the scripts,
 since those use fundamentally different approaches.

-The estimations, as mentioned above, do not take into consideration any
-particular underlying hardware. What this means is *the estimations assume a
-perfect world with a single GPU, with infinite memory and infinite compute*.
-Network communication, model-splitting, parallelism and other such things are
-not considered in the script. Most likely however, you can assume a 20-40%
-overhead on memory, and possibly a little bit more on compute (usual GPU
-utilization with hundreds of GPU's is 30-40% per GPU, meaning that 60-70% of the
-amount of GPU-hours that you have will most likely be wasted on communication).
-
-The scripts also do not consider techniques such as mixed precision training or
-gradient checkpointing which will almost certainly result in less memory and
-compute used. These, however, depend very much on the particular setup and what
-the user expects from the memory / compute trade-off. As a rule of thumb, mixed
-precision will decrease training time by 20-40%, and gradient checkpointing can
-reduce memory usage by quite a large amount, trading memory used for compute
-(meaning more training time needed overall). Therefore, the estimations should
-be taken as a worst-case scenario in this regard.
-
 ## Memory requirements

 Memory requirements are given by the script `memory_req.py`. Change the values
-at the top (or use the predefined defaults), run it and get the output.
+at the top (or use the predefined defaults), run it and get the output. These
+assume full 32-bit floating point training (mixed precision will slightly
+decrease the total memory, since some of the activations will be calculated
+using 16-bit floating point; therefore, expect the activations to be slightly
+lower in memory usage, ideally halved if all activations are fp16).
+
+The memory value calculated by the script will, essentially, cover the model,
+the activations, the gradients and the optimizer state. If the model is big
+(talking about B's in model size), then it will most likely not fit into a
+single GPU. In this case, tensor-parallelism and pipeline parallelism are
+commonly used methods, and these account for a memory overhead, since the
+activations need to be copied and passed around. Tensor-parallelism is usually
+used intra-node (in a DGX node, for example), and pipeline parallelism is used
+inter-node (between DGX clusters). The final memory outputted by the script is
+therefore only the memory required for tensor and pipeline parallelism. Data
+parallel is oftentimes further used. Each data-parallel unit increases the
+number of needed GPU's. E.g. training a 65B model requires ~32 GPUs (which are 4
+DGX nodes, each with 8x A100's). This results in 8-way tensor parallel
+intra-node, and 4-way pipeline parallel inter-node. Scaling up then happens
+using data-parallel. For example, using 32-way data-parallel would result in a
+total number of GPUs of `32 (the base number of GPUs needed to hold the model,
+consisting in 4x DGX) * 32 (data-parallel, each unit adds a model on top) = 1024
+GPUs`.
+
+For a more detailed overview of the above, see [Nvidia's great blog post on
+scaling models using
+Megatron](https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/),
+as well as [scaling experiments using Megatron and AMD on the LUMI
+cluster](https://lumi-supercomputer.eu/scaling-the-pre-training-of-large-language-models-of-100b-parameters-to-thousands-of-amd-mi250x-gpus-on-lumi/).