Skip to content
Snippets Groups Projects

Compute/memory requirements scripts

Open Alexandru-Mihai GHERGHESCU requested to merge feature/scripts into main
Files
2
@@ -41,9 +41,9 @@ parallel is oftentimes further used. Each data-parallel unit increases the
number of needed GPU's. E.g. training a 65B model requires ~32 GPUs (which are 4
DGX nodes, each with 8x A100's). This results in 8-way tensor parallel
intra-node, and 4-way pipeline parallel inter-node. Scaling up then happens
using data-parallel. For example, using 32-way data-parallel would result in a
using data-parallel. For example, using 64-way data-parallel would result in a
total number of GPUs of `32 (the base number of GPUs needed to hold the model,
consisting in 4x DGX) * 32 (data-parallel, each unit adds a model on top) = 1024
consisting in 4x DGX) * 64 (data-parallel, each unit adds a model on top) = 2048
GPUs`.
For a more detailed overview of the above, see [Nvidia's great blog post on
Loading