Skip to content

Draft: Add support for data parallelism on a single node

Vlad-Andrei BĂDOIU (78692) requested to merge vladb/ddp into main

Pull Request Title

We introduce a library that decouples most of the distributed code from the training library. The API is described below:


# Setup the model, dataloader and optimizer
...

distributon = Distributon(num_devices=8)

# Launches one process per device
distributon.launch()

# Wrappers used for parallel computation
model = distributon.setup_model(model)
optimizer = distributon.setup_optimizer(optimizer)
dataloader = distributon.setup_dataloader(dataloader)

for epoch in range(20):
   ...
   distributon.backward(loss)

To achieve this we need to break the distributed library into four parts: the API, the strategy, the launcher, and the environment. The strategy implements the parallelism logic for the model. In this pull request we support only one strategy: Data Parallelism. The strategy uses the launcher and environment to setup the nodes and start the processes.

The launcher contains the code related to starting processes for each device. This includes setting the rank and copying the relevant seeds. We use Popen as the backed. The launcher uses the relevant information from the environment.

The environment is the glue with the running environment. It deals with extracting and setting environment related options. Right now we support a simple linux machine as the environment. Support for SLURM is next.

Description

Wants to merge: vladb/ddp into main

Type of change

  • Bug fix
  • New feature
  • Enhancement
  • Documentation update
  • Other (specify right below)

Merge request commits

  • Add support for data parallelism on a single node

Related Issues

Screenshots or GIFs

Checklist

  • I have tested the code with the changes manually.
  • My code follows the project's style guidelines.
  • I have documented my code for others to understand.
  • I have updated documentation as needed (including README.md, code comments and doc strings).

Reviewer Guidelines

Additional Notes

@mentions

Edited by Vlad-Andrei BĂDOIU (78692)

Merge request reports