Skip to content
Snippets Groups Projects

Compute/memory requirements scripts

Open Alexandru-Mihai GHERGHESCU requested to merge feature/scripts into main
Files
2
@@ -19,25 +19,35 @@ Other types of model architectures (such as mixture of experts, BERT-style
models etc.) should not rely too much on the estimations given by the scripts,
since those use fundamentally different approaches.
The estimations, as mentioned above, do not take into consideration any
particular underlying hardware. What this means is *the estimations assume a
perfect world with a single GPU, with infinite memory and infinite compute*.
Network communication, model-splitting, parallelism and other such things are
not considered in the script. Most likely however, you can assume a 20-40%
overhead on memory, and possibly a little bit more on compute (usual GPU
utilization with hundreds of GPU's is 30-40% per GPU, meaning that 60-70% of the
amount of GPU-hours that you have will most likely be wasted on communication).
The scripts also do not consider techniques such as mixed precision training or
gradient checkpointing which will almost certainly result in less memory and
compute used. These, however, depend very much on the particular setup and what
the user expects from the memory / compute trade-off. As a rule of thumb, mixed
precision will decrease training time by 20-40%, and gradient checkpointing can
reduce memory usage by quite a large amount, trading memory used for compute
(meaning more training time needed overall). Therefore, the estimations should
be taken as a worst-case scenario in this regard.
## Memory requirements
Memory requirements are given by the script `memory_req.py`. Change the values
at the top (or use the predefined defaults), run it and get the output.
at the top (or use the predefined defaults), run it and get the output. These
assume full 32-bit floating point training (mixed precision will slightly
decrease the total memory, since some of the activations will be calculated
using 16-bit floating point; therefore, expect the activations to be slightly
lower in memory usage, ideally halved if all activations are fp16).
The memory value calculated by the script will, essentially, cover the model,
the activations, the gradients and the optimizer state. If the model is big
(talking about B's in model size), then it will most likely not fit into a
single GPU. In this case, tensor-parallelism and pipeline parallelism are
commonly used methods, and these account for a memory overhead, since the
activations need to be copied and passed around. Tensor-parallelism is usually
used intra-node (in a DGX node, for example), and pipeline parallelism is used
inter-node (between DGX clusters). The final memory outputted by the script is
therefore only the memory required for tensor and pipeline parallelism. Data
parallel is oftentimes further used. Each data-parallel unit increases the
number of needed GPU's. E.g. training a 65B model requires ~32 GPUs (which are 4
DGX nodes, each with 8x A100's). This results in 8-way tensor parallel
intra-node, and 4-way pipeline parallel inter-node. Scaling up then happens
using data-parallel. For example, using 32-way data-parallel would result in a
total number of GPUs of `32 (the base number of GPUs needed to hold the model,
consisting in 4x DGX) * 32 (data-parallel, each unit adds a model on top) = 1024
GPUs`.
For a more detailed overview of the above, see [Nvidia's great blog post on
scaling models using
Megatron](https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/),
as well as [scaling experiments using Megatron and AMD on the LUMI
cluster](https://lumi-supercomputer.eu/scaling-the-pre-training-of-large-language-models-of-100b-parameters-to-thousands-of-amd-mi250x-gpus-on-lumi/).
Loading