Alexandru-Mihai GHERGHESCU · Alexandru-Mihai GHERGHESCU · b5975f1a · ecd075dc · 161ab0e9 · 536de844
--- a/scripts/memory_compute_estimations/README.md

+ 45

− 3
+++ b/scripts/memory_compute_estimations/README.md

+ 45

− 3
 @@ -8,9 +8,9 @@ which can then be iterated upon.
 The estimations are given based on the usual hyperparameters chosen when
 training a Transformer-type LLM: number of layers, number of heads, embeddings
 dimension, batch size, training context length, vocabulary size. These should
-give a rough idea as to the memory requirements for a classical GPT-style
-decoder-only Transformer. Other modifications on top of the architecture (such
-as SwiGLU activations inside the feedforward layer, multi-query or grouped-query
+give a rough idea as to the requirements for a classical GPT-style decoder-only
+Transformer. Other modifications on top of the architecture (such as SwiGLU
+activations inside the feedforward layer, multi-query or grouped-query
 attention, rotary positional embeddings etc.) will likely change the values by a
 little bit, however not enough to justify a totally different approach to the
 estimations.
 @@ -46,8 +46,50 @@ total number of GPUs of `32 (the base number of GPUs needed to hold the model,
 consisting in 4x DGX) * 64 (data-parallel, each unit adds a model on top) = 2048
 GPUs`.

+**Note:** Keep in mind that splitting a model on multiple GPU's/clusters means
+assigning layers to each GPU/cluster. You can't assign a layer and a half to one
+GPU, and another layer and a half to another GPU. 3 layers would (depending on
+model size etc.) most likely be split into 3 GPU's, leaving the cards
+half-filled. Don't worry too much about the empty memory, as that can be easily
+filled by increasing the batch size. The important thing to take away is that
+splitting a model isn't just a simple mathematical division between the total
+memory needed by the model and the memory available on a GPU (although that's
+what the script does, for a lack of a better approximation method). Expect,
+therefore, more GPU's to be needed for a correct partitioning of model layers.
+
 For a more detailed overview of the above, see [Nvidia's great blog post on
 scaling models using
 Megatron](https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/),
 as well as [scaling experiments using Megatron and AMD on the LUMI
 cluster](https://lumi-supercomputer.eu/scaling-the-pre-training-of-large-language-models-of-100b-parameters-to-thousands-of-amd-mi250x-gpus-on-lumi/).
+
+## Compute requirements
+
+Compute requirements for training models can be calculated using the script
+`compute_req.py`. Change the values at the top (or use predefined defaults), run
+it and get the output.
+
+Notice that total compute is not affected by either batch size or context
+length. Since the model needs to see the whole dataset anyway, it doesn't really
+matter how it is partitioned (it doesn't matter whether there are fewer big
+chunks, or more small chunks). Batch size and context length will, however,
+affect memory usage. Context length will also indirectly affect dataset size.
+The intuition is that bigger context would need more dataset tokens to be
+fully trained. Increasing context length should generally result in increasing
+dataset size, though the scaling is definitely not linear (it's a best-guess
+scenario).
+
+Be careful about the estimations given low numbers (low dataset size, a model
+with a low number of parameters etc.), as communication/software times will
+start to matter when the compute needed per batch update is low. The GPU's
+usually work best when fed big matrices and when network communication only
+represents a small percent of the batch update.
+
+# Running the scripts together
+
+> You probably want to first run `memory_req.py`, which outputs the number of
+> GPU's needed for baseline model parallelism (tensor + pipeline). Don't bother
+> too much about adjusting batch size, as gradient accumulation can be used to
+> increase that value without memory overhead. The total number of GPU's should
+> then be adapted in `compute_req.py`, and multiplied by whatever factor for
+> using data-parallel (2x, 3x, 4x etc.), as described above.