Snippets Groups Projects

Open Vlad-Andrei BĂDOIU (78692) requested to merge vladb/ddp into main 1 year ago

Pull Request Title

We introduce a library that decouples most of the distributed code from the training library. The API is described below:


# Setup the model, dataloader and optimizer
...

distributon = Distributon(num_devices=8)

# Launches one process per device
distributon.launch()

# Wrappers used for parallel computation
model = distributon.setup_model(model)
optimizer = distributon.setup_optimizer(optimizer)
dataloader = distributon.setup_dataloader(dataloader)

for epoch in range(20):
   ...
   distributon.backward(loss)

To achieve this we need to break the distributed library into four parts: the API, the strategy, the launcher, and the environment. The strategy implements the parallelism logic for the model. In this pull request we support only one strategy: Data Parallelism. The strategy uses the launcher and environment to setup the nodes and start the processes.

The launcher contains the code related to starting processes for each device. This includes setting the rank and copying the relevant seeds. We use Popen as the backed. The launcher uses the relevant information from the environment.

The environment is the glue with the running environment. It deals with extracting and setting environment related options. Right now we support a simple linux machine as the environment. Support for SLURM is next.

Description

Wants to merge: vladb/ddp into main

Type of change

Merge request commits

Add support for data parallelism on a single node

Related Issues

Screenshots or GIFs

Checklist

I have tested the code with the changes manually.
My code follows the project's style guidelines.
I have documented my code for others to understand.
I have updated documentation as needed (including README.md, code comments and doc strings).

Reviewer Guidelines

Additional Notes

@mentions

Edited 1 year ago by Vlad-Andrei BĂDOIU (78692)

Activity

Vlad-Andrei BĂDOIU (78692) marked the checklist item New feature as completed 1 year ago

marked the checklist item New feature as completed
Vlad-Andrei BĂDOIU (78692) changed the description 1 year ago

changed the description
Vlad-Andrei BĂDOIU (78692) changed the description 1 year ago

changed the description
Alexandru-Mihai GHERGHESCU @agherghescu2411 · 1 year ago

Owner

Left a TODO for myself, will take a closer look over the next week. We have to be a bit careful with how we set this up, so I'll also have to take a look at PyTorch docs to make sure I understand what goes on here.
Vlad-Andrei BĂDOIU (78692) added 1 commit 1 year ago
added 1 commit

ecd5e6e9 - Fixes.

Compare with previous version
Vlad-Andrei BĂDOIU (78692) added 1 commit 1 year ago
added 1 commit

29236fe3 - Switch to Pytorch DataLoader

Compare with previous version
Vlad-Andrei BĂDOIU (78692) added 1 commit 1 year ago
added 1 commit

dffb6dce - Finally works

Compare with previous version
Vlad-Andrei BĂDOIU (78692) added 1 commit 1 year ago
added 1 commit

d2285dda - Cleanup

Compare with previous version
Vlad-Andrei BĂDOIU (78692) added 4 commits 1 year ago
added 4 commits

d2285dda...8579fc15 - 2 commits from branch main

6d3a3004 - Introduce the module for distributed training

471a9a69 - Adapt optimus to Distributon

Compare with previous version
Vlad-Andrei BĂDOIU (78692) added 1 commit 1 year ago
added 1 commit

bf875249 - Adapt optimus to Distributon

Compare with previous version
Vlad-Andrei BĂDOIU (78692) marked this merge request as ready 1 year ago

marked this merge request as ready
Vlad-Andrei BĂDOIU (78692) @vlad_andrei.badoiu1 · 1 year ago

Author Owner

@agherghescu2411 @alexandru.agache Now it's in a reviewable state. I'll test it some more in the meantime.
Vlad-Andrei BĂDOIU (78692) marked this merge request as draft 1 year ago

marked this merge request as draft

Please register or sign in to reply