Skip to content
Snippets Groups Projects
Unverified Commit f4abadf8 authored by Alexandru-Mihai GHERGHESCU's avatar Alexandru-Mihai GHERGHESCU
Browse files

Add note about splitting a model's layers on multiple GPU's

parent ec26b62f
No related branches found
No related tags found
1 merge request!9Compute/memory requirements scripts
This commit is part of merge request !9. Comments created here will be created in the context of that merge request.
...@@ -46,6 +46,17 @@ total number of GPUs of `32 (the base number of GPUs needed to hold the model, ...@@ -46,6 +46,17 @@ total number of GPUs of `32 (the base number of GPUs needed to hold the model,
consisting in 4x DGX) * 64 (data-parallel, each unit adds a model on top) = 2048 consisting in 4x DGX) * 64 (data-parallel, each unit adds a model on top) = 2048
GPUs`. GPUs`.
**Note:** Keep in mind that splitting a model on multiple GPU's/clusters means
assigning layers to each GPU/cluster. You can't assign a layer and a half to one
GPU, and another layer and a half to another GPU. 3 layers would (depending on
model size etc.) most likely be split into 3 GPU's, leaving the cards
half-filled. Don't worry too much about the empty memory, as that can be easily
filled by increasing the batch size. The important thing to take away is that
splitting a model isn't just a simple mathematical division between the total
memory needed by the model and the memory available on a GPU (although that's
what the script does, for a lack of a better approximation method). Expect,
therefore, more GPU's to be needed for a correct partitioning of model layers.
For a more detailed overview of the above, see [Nvidia's great blog post on For a more detailed overview of the above, see [Nvidia's great blog post on
scaling models using scaling models using
Megatron](https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/), Megatron](https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/),
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment