Add note about splitting a model's layers on multiple GPU's

f4abadf8 · Alexandru-Mihai GHERGHESCU · ec26b62f · f4abadf8
Unverified Commit f4abadf8 authored 1 year ago by Alexandru-Mihai GHERGHESCU
--- a/scripts/memory_compute_estimations/README.md
+++ b/scripts/memory_compute_estimations/README.md
@@ -46,6 +46,17 @@ total number of GPUs of `32 (the base number of GPUs needed to hold the model,
 consisting in 4x DGX) * 64 (data-parallel, each unit adds a model on top) = 2048
 GPUs`.
+**Note:** Keep in mind that splitting a model on multiple GPU's/clusters means
+assigning layers to each GPU/cluster. You can't assign a layer and a half to one
+GPU, and another layer and a half to another GPU. 3 layers would (depending on
+model size etc.) most likely be split into 3 GPU's, leaving the cards
+half-filled. Don't worry too much about the empty memory, as that can be easily
+filled by increasing the batch size. The important thing to take away is that
+splitting a model isn't just a simple mathematical division between the total
+memory needed by the model and the memory available on a GPU (although that's
+what the script does, for a lack of a better approximation method). Expect,
+therefore, more GPU's to be needed for a correct partitioning of model layers.
 For a more detailed overview of the above, see [Nvidia's great blog post on
 scaling models using
 Megatron](https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/),