Alexandru-Mihai GHERGHESCU · b5975f1a · ecd075dc · 161ab0e9 · 536de844 · c4f7f774
--- a/scripts/memory_compute_estimations/README.md

+ 11

− 0
+++ b/scripts/memory_compute_estimations/README.md

+ 11

− 0
 @@ -46,6 +46,17 @@ total number of GPUs of `32 (the base number of GPUs needed to hold the model,
 consisting in 4x DGX) * 64 (data-parallel, each unit adds a model on top) = 2048
 GPUs`.

+**Note:** Keep in mind that splitting a model on multiple GPU's/clusters means
+assigning layers to each GPU/cluster. You can't assign a layer and a half to one
+GPU, and another layer and a half to another GPU. 3 layers would (depending on
+model size etc.) most likely be split into 3 GPU's, leaving the cards
+half-filled. Don't worry too much about the empty memory, as that can be easily
+filled by increasing the batch size. The important thing to take away is that
+splitting a model isn't just a simple mathematical division between the total
+memory needed by the model and the memory available on a GPU (although that's
+what the script does, for a lack of a better approximation method). Expect,
+therefore, more GPU's to be needed for a correct partitioning of model layers.
+
 For a more detailed overview of the above, see [Nvidia's great blog post on
 scaling models using
 Megatron](https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/),