We introduce a library that decouples most of the distributed code from the training library. The API is described below:
# Setup the model, dataloader and optimizer
...
distributon = Distributon(num_devices=8)
# Launches one process per device
distributon.launch()
# Wrappers used for parallel computation
model = distributon.setup_model(model)
optimizer = distributon.setup_optimizer(optimizer)
dataloader = distributon.setup_dataloader(dataloader)
for epoch in range(20):
...
distributon.backward(loss)
To achieve this we need to break the distributed library into four parts: the API, the strategy, the launcher, and the environment. The strategy implements the parallelism logic for the model. In this pull request we support only one strategy: Data Parallelism. The strategy uses the launcher and environment to setup the nodes and start the processes.
The launcher contains the code related to starting processes for each device. This includes setting the rank and copying the relevant seeds. We use Popen as the backed. The launcher uses the relevant information from the environment.
The environment is the glue with the running environment. It deals with extracting and setting environment related options. Right now we support a simple linux machine as the environment. Support for SLURM is next.
Wants to merge: vladb/ddp into main
README.md
, code
comments and doc strings).