Alexandru-Mihai GHERGHESCU
--- a/scripts/memory_compute_estimations/README.md

+ 30

− 0
+++ b/scripts/memory_compute_estimations/README.md

+ 30

− 0
 @@ -51,3 +51,33 @@ scaling models using
 Megatron](https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/),
 as well as [scaling experiments using Megatron and AMD on the LUMI
 cluster](https://lumi-supercomputer.eu/scaling-the-pre-training-of-large-language-models-of-100b-parameters-to-thousands-of-amd-mi250x-gpus-on-lumi/).
+
+## Compute requirements
+
+Compute requirements for training models can be calculated using the script
+`compute_req.py`. Change the values at the top (or use predefined defaults), run
+it and get the output.
+
+Notice that total compute is not affected by either batch size or context
+length. Since the model needs to see the whole dataset anyway, it doesn't really
+matter how it is partitioned (it doesn't matter whether there are fewer big
+chunks, or more large chunks). Batch size and context length will, however,
+affect memory usage. Context length will also indirectly affect dataset size.
+The intuition is that bigger context would need more dataset tokens to be
+fully trained. Increasing context length should generally result in increasing
+dataset size, though the scaling is definitely not linear (it's a best-guess
+scenario).
+
+Be careful about the estimations given low numbers (low dataset size, a model
+with a low number of parameters etc.), as communication/software times will
+start to matter when the compute needed per step update is low. The GPU's
+usually work best when fed big matrices, which keeps them occupied more fully.
+
+# Running the scripts together
+
+> You probably want to first run `memory_req.py`, which outputs the number of
+> GPU's needed for baseline model parallelism (tensor + pipeline). Don't bother
+> too much about adjusting batch size, as gradient accumulation can be used to
+> increase that value without memory overhead. The total number of GPU's should
+> then be adapted in `compute_req.py`, and multiplied by whatever factor for
+> using data-parallel (2x, 3x, 4x etc.), as described above.