Draft: Add support for data parallelism on a single node

Review changes
Download
Patches
Plain diff

Open Draft: Add support for data parallelism on a single node

Overview 2
Commits 2
Pipelines 7
Changes 11

Open Vlad-Andrei BĂDOIU (78692) requested to merge vladb/ddp into main 1 year ago

Overview 2
Commits 2
Pipelines 7
Changes 11

Pull Request Title

We introduce a library that decouples most of the distributed code from the training library. The API is described below:


# Setup the model, dataloader and optimizer
...

distributon = Distributon(num_devices=8)

# Launches one process per device
distributon.launch()

# Wrappers used for parallel computation
model = distributon.setup_model(model)
optimizer = distributon.setup_optimizer(optimizer)
dataloader = distributon.setup_dataloader(dataloader)

for epoch in range(20):
   ...
   distributon.backward(loss)

To achieve this we need to break the distributed library into four parts: the API, the strategy, the launcher, and the environment. The strategy implements the parallelism logic for the model. In this pull request we support only one strategy: Data Parallelism. The strategy uses the launcher and environment to setup the nodes and start the processes.

The launcher contains the code related to starting processes for each device. This includes setting the rank and copying the relevant seeds. We use Popen as the backed. The launcher uses the relevant information from the environment.

The environment is the glue with the running environment. It deals with extracting and setting environment related options. Right now we support a simple linux machine as the environment. Support for SLURM is next.

Description

Wants to merge: vladb/ddp into main

Type of change

Bug fix
New feature
Enhancement
Documentation update
Other (specify right below)

Merge request commits

Add support for data parallelism on a single node

Related Issues

Screenshots or GIFs

Checklist

I have tested the code with the changes manually.
My code follows the project's style guidelines.
I have documented my code for others to understand.
I have updated documentation as needed (including README.md, code comments and doc strings).

Reviewer Guidelines

Additional Notes

@mentions

Edited 1 year ago by Vlad-Andrei BĂDOIU (78692)

Merge request reports

0 Assignees

None

Select assignees

0 Reviewers

Request review from

Labels

None

Select labels

Manage project labels

Milestone

None

Time tracking

No estimate or time spent

0 Participants