Alexandru-Mihai GHERGHESCU · Alexandru-Mihai GHERGHESCU · b5975f1a · ecd075dc · 161ab0e9 · 536de844
--- a/scripts/memory_compute_estimations/README.md

+ 41

− 0
+++ b/scripts/memory_compute_estimations/README.md

+ 41

− 0
 @@ -46,8 +46,49 @@ total number of GPUs of `32 (the base number of GPUs needed to hold the model,
 consisting in 4x DGX) * 64 (data-parallel, each unit adds a model on top) = 2048
 GPUs`.

+**Note:** Keep in mind that splitting a model on multiple GPU's/clusters means
+assigning layers to each GPU/cluster. You can't assign a layer and a half to one
+GPU, and another layer and a half to another GPU. 3 layers would (depending on
+model size etc.) most likely be split into 3 GPU's, leaving the cards
+half-filled. Don't worry too much about the empty memory, as that can be easily
+filled by increasing the batch size. The important thing to take away is that
+splitting a model isn't just a simple mathematical division between the total
+memory needed by the model and the memory available on a GPU (although that's
+what the script does, for a lack of a better approximation method). Expect,
+therefore, more GPU's to be needed for a correct partitioning of model layers.
+
 For a more detailed overview of the above, see [Nvidia's great blog post on
 scaling models using
 Megatron](https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/),
 as well as [scaling experiments using Megatron and AMD on the LUMI
 cluster](https://lumi-supercomputer.eu/scaling-the-pre-training-of-large-language-models-of-100b-parameters-to-thousands-of-amd-mi250x-gpus-on-lumi/).
+
+## Compute requirements
+
+Compute requirements for training models can be calculated using the script
+`compute_req.py`. Change the values at the top (or use predefined defaults), run
+it and get the output.
+
+Notice that total compute is not affected by either batch size or context
+length. Since the model needs to see the whole dataset anyway, it doesn't really
+matter how it is partitioned (it doesn't matter whether there are fewer big
+chunks, or more large chunks). Batch size and context length will, however,
+affect memory usage. Context length will also indirectly affect dataset size.
+The intuition is that bigger context would need more dataset tokens to be
+fully trained. Increasing context length should generally result in increasing
+dataset size, though the scaling is definitely not linear (it's a best-guess
+scenario).
+
+Be careful about the estimations given low numbers (low dataset size, a model
+with a low number of parameters etc.), as communication/software times will
+start to matter when the compute needed per step update is low. The GPU's
+usually work best when fed big matrices, which keeps them occupied more fully.
+
+# Running the scripts together
+
+> You probably want to first run `memory_req.py`, which outputs the number of
+> GPU's needed for baseline model parallelism (tensor + pipeline). Don't bother
+> too much about adjusting batch size, as gradient accumulation can be used to
+> increase that value without memory overhead. The total number of GPU's should
+> then be adapted in `compute_req.py`, and multiplied by whatever factor for
+> using data-parallel (2x, 3x, 4x etc.), as described above.