Alexandru-Mihai GHERGHESCU · Alexandru-Mihai GHERGHESCU · b5975f1a · ecd075dc · 161ab0e9 · 536de844
--- a/scripts/memory_compute_estimations/README.md

+ 2

− 2
+++ b/scripts/memory_compute_estimations/README.md

+ 2

− 2
 @@ -41,9 +41,9 @@ parallel is oftentimes further used. Each data-parallel unit increases the
 number of needed GPU's. E.g. training a 65B model requires ~32 GPUs (which are 4
 DGX nodes, each with 8x A100's). This results in 8-way tensor parallel
 intra-node, and 4-way pipeline parallel inter-node. Scaling up then happens
-using data-parallel. For example, using 32-way data-parallel would result in a
+using data-parallel. For example, using 64-way data-parallel would result in a
 total number of GPUs of `32 (the base number of GPUs needed to hold the model,
-consisting in 4x DGX) * 32 (data-parallel, each unit adds a model on top) = 1024
+consisting in 4x DGX) * 64 (data-parallel, each unit adds a model on top) = 2048
 GPUs`.

 For a more detailed overview of the above, see [Nvidia's great blog post on