Draft: Add support for data parallelism on a single node
Pull Request Title
We introduce a library that decouples most of the distributed code from the training library. The API is described below:
# Setup the model, dataloader and optimizer
...
distributon = Distributon(num_devices=8)
# Launches one process per device
distributon.launch()
# Wrappers used for parallel computation
model = distributon.setup_model(model)
optimizer = distributon.setup_optimizer(optimizer)
dataloader = distributon.setup_dataloader(dataloader)
for epoch in range(20):
...
distributon.backward(loss)
To achieve this we need to break the distributed library into four parts: the API, the strategy, the launcher, and the environment. The strategy implements the parallelism logic for the model. In this pull request we support only one strategy: Data Parallelism. The strategy uses the launcher and environment to setup the nodes and start the processes.
The launcher contains the code related to starting processes for each device. This includes setting the rank and copying the relevant seeds. We use Popen as the backed. The launcher uses the relevant information from the environment.
The environment is the glue with the running environment. It deals with extracting and setting environment related options. Right now we support a simple linux machine as the environment. Support for SLURM is next.
Description
Wants to merge: vladb/ddp into main
Type of change
-
Bug fix -
New feature -
Enhancement -
Documentation update -
Other (specify right below)
Merge request commits
- Add support for data parallelism on a single node
Related Issues
Screenshots or GIFs
Checklist
-
I have tested the code with the changes manually. -
My code follows the project's style guidelines. -
I have documented my code for others to understand. -
I have updated documentation as needed (including README.md
, code comments and doc strings).