Skip to content
Snippets Groups Projects

Compute/memory requirements scripts

Open Alexandru-Mihai GHERGHESCU requested to merge feature/scripts into main
Files
4
+ 83
0
 
# Compute and memory estimations for training LLM's
 
 
There are a few scripts here useful for predicting compute/memory requirements
 
for training large language models. They are quite basic, assuming no particular
 
underlying hardware or software framework, however they give a good baseline,
 
which can then be iterated upon.
 
 
The estimations are given based on the usual hyperparameters chosen when
 
training a Transformer-type LLM: number of layers, number of heads, embeddings
 
dimension, batch size, training context length, vocabulary size. These should
 
give a rough idea as to the memory requirements for a classical GPT-style
 
decoder-only Transformer. Other modifications on top of the architecture (such
 
as SwiGLU activations inside the feedforward layer, multi-query or grouped-query
 
attention, rotary positional embeddings etc.) will likely change the values by a
 
little bit, however not enough to justify a totally different approach to the
 
estimations.
 
 
Other types of model architectures (such as mixture of experts, BERT-style
 
models etc.) should not rely too much on the estimations given by the scripts,
 
since those use fundamentally different approaches.
 
 
## Memory requirements
 
 
Memory requirements are given by the script `memory_req.py`. Change the values
 
at the top (or use the predefined defaults), run it and get the output. These
 
assume full 32-bit floating point training (mixed precision will slightly
 
decrease the total memory, since some of the activations will be calculated
 
using 16-bit floating point; therefore, expect the activations to be slightly
 
lower in memory usage, ideally halved if all activations are fp16).
 
 
The memory value calculated by the script will, essentially, cover the model,
 
the activations, the gradients and the optimizer state. If the model is big
 
(talking about B's in model size), then it will most likely not fit into a
 
single GPU. In this case, tensor-parallelism and pipeline parallelism are
 
commonly used methods, and these account for a memory overhead, since the
 
activations need to be copied and passed around. Tensor-parallelism is usually
 
used intra-node (in a DGX node, for example), and pipeline parallelism is used
 
inter-node (between DGX clusters). The final memory outputted by the script is
 
therefore only the memory required for tensor and pipeline parallelism. Data
 
parallel is oftentimes further used. Each data-parallel unit increases the
 
number of needed GPU's. E.g. training a 65B model requires ~32 GPUs (which are 4
 
DGX nodes, each with 8x A100's). This results in 8-way tensor parallel
 
intra-node, and 4-way pipeline parallel inter-node. Scaling up then happens
 
using data-parallel. For example, using 64-way data-parallel would result in a
 
total number of GPUs of `32 (the base number of GPUs needed to hold the model,
 
consisting in 4x DGX) * 64 (data-parallel, each unit adds a model on top) = 2048
 
GPUs`.
 
 
For a more detailed overview of the above, see [Nvidia's great blog post on
 
scaling models using
 
Megatron](https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/),
 
as well as [scaling experiments using Megatron and AMD on the LUMI
 
cluster](https://lumi-supercomputer.eu/scaling-the-pre-training-of-large-language-models-of-100b-parameters-to-thousands-of-amd-mi250x-gpus-on-lumi/).
 
 
## Compute requirements
 
 
Compute requirements for training models can be calculated using the script
 
`compute_req.py`. Change the values at the top (or use predefined defaults), run
 
it and get the output.
 
 
Notice that total compute is not affected by either batch size or context
 
length. Since the model needs to see the whole dataset anyway, it doesn't really
 
matter how it is partitioned (it doesn't matter whether there are fewer big
 
chunks, or more large chunks). Batch size and context length will, however,
 
affect memory usage. Context length will also indirectly affect dataset size.
 
The intuition is that bigger context would need more dataset tokens to be
 
fully trained. Increasing context length should generally result in increasing
 
dataset size, though the scaling is definitely not linear (it's a best-guess
 
scenario).
 
 
Be careful about the estimations given low numbers (low dataset size, a model
 
with a low number of parameters etc.), as communication/software times will
 
start to matter when the compute needed per step update is low. The GPU's
 
usually work best when fed big matrices, which keeps them occupied more fully.
 
 
# Running the scripts together
 
 
> You probably want to first run `memory_req.py`, which outputs the number of
 
> GPU's needed for baseline model parallelism (tensor + pipeline). Don't bother
 
> too much about adjusting batch size, as gradient accumulation can be used to
 
> increase that value without memory overhead. The total number of GPU's should
 
> then be adapted in `compute_req.py`, and multiplied by whatever factor for
 
> using data-parallel (2x, 3x, 4x etc.), as described above.
Loading