Skip to content
Snippets Groups Projects

Compute/memory requirements scripts

Open Alexandru-Mihai GHERGHESCU requested to merge feature/scripts into main
Files
3
@@ -8,9 +8,9 @@ which can then be iterated upon.
The estimations are given based on the usual hyperparameters chosen when
training a Transformer-type LLM: number of layers, number of heads, embeddings
dimension, batch size, training context length, vocabulary size. These should
give a rough idea as to the memory requirements for a classical GPT-style
decoder-only Transformer. Other modifications on top of the architecture (such
as SwiGLU activations inside the feedforward layer, multi-query or grouped-query
give a rough idea as to the requirements for a classical GPT-style decoder-only
Transformer. Other modifications on top of the architecture (such as SwiGLU
activations inside the feedforward layer, multi-query or grouped-query
attention, rotary positional embeddings etc.) will likely change the values by a
little bit, however not enough to justify a totally different approach to the
estimations.
@@ -46,8 +46,50 @@ total number of GPUs of `32 (the base number of GPUs needed to hold the model,
consisting in 4x DGX) * 64 (data-parallel, each unit adds a model on top) = 2048
GPUs`.
**Note:** Keep in mind that splitting a model on multiple GPU's/clusters means
assigning layers to each GPU/cluster. You can't assign a layer and a half to one
GPU, and another layer and a half to another GPU. 3 layers would (depending on
model size etc.) most likely be split into 3 GPU's, leaving the cards
half-filled. Don't worry too much about the empty memory, as that can be easily
filled by increasing the batch size. The important thing to take away is that
splitting a model isn't just a simple mathematical division between the total
memory needed by the model and the memory available on a GPU (although that's
what the script does, for a lack of a better approximation method). Expect,
therefore, more GPU's to be needed for a correct partitioning of model layers.
For a more detailed overview of the above, see [Nvidia's great blog post on
scaling models using
Megatron](https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/),
as well as [scaling experiments using Megatron and AMD on the LUMI
cluster](https://lumi-supercomputer.eu/scaling-the-pre-training-of-large-language-models-of-100b-parameters-to-thousands-of-amd-mi250x-gpus-on-lumi/).
## Compute requirements
Compute requirements for training models can be calculated using the script
`compute_req.py`. Change the values at the top (or use predefined defaults), run
it and get the output.
Notice that total compute is not affected by either batch size or context
length. Since the model needs to see the whole dataset anyway, it doesn't really
matter how it is partitioned (it doesn't matter whether there are fewer big
chunks, or more small chunks). Batch size and context length will, however,
affect memory usage. Context length will also indirectly affect dataset size.
The intuition is that bigger context would need more dataset tokens to be
fully trained. Increasing context length should generally result in increasing
dataset size, though the scaling is definitely not linear (it's a best-guess
scenario).
Be careful about the estimations given low numbers (low dataset size, a model
with a low number of parameters etc.), as communication/software times will
start to matter when the compute needed per batch update is low. The GPU's
usually work best when fed big matrices and when network communication only
represents a small percent of the batch update.
# Running the scripts together
> You probably want to first run `memory_req.py`, which outputs the number of
> GPU's needed for baseline model parallelism (tensor + pipeline). Don't bother
> too much about adjusting batch size, as gradient accumulation can be used to
> increase that value without memory overhead. The total number of GPU's should
> then be adapted in `compute_req.py`, and multiplied by whatever factor for
> using data-parallel (2x, 3x, 4x etc.), as described above.
Loading