Skip to content

GitLab

Explore

Sign in

Primary navigation

Project

O

Optimus Prime
- Activity
- Members
- Labels
- Issues
- Issue boards
- Milestones
- Iterations
- Wiki
- Requirements
- Environments
- Terraform modules
- Incidents

Snippets Groups Projects

!25

Re-factor optimus-prime code (optimus-prime v2)

Review changes
Download
Patches
Plain diff

Re-factor optimus-prime code (optimus-prime v2)

feat/overhaul into main

Overview 18
Commits 14
Pipelines 13
Changes 50

Merged Alexandru-Mihai GHERGHESCU requested to merge feat/overhaul into main 10 months ago

Overview 18
Commits 14
Pipelines 13
Changes 50

Pull Request Title

Complete training code overhaul

Description

Implements a bunch of big and small changes. Most of the ideas are modeled after HuggingFace, which has a very good design pattern for training code. This PR tracks features/fixes/improvements accordingly:

model improvements:
- separate configuration class for the Optimus model; all model switches and knobs, easily reachable in one place; initialize a model from a config.json file
- activation / gradient checkpointing
- flash attention 2 using pytorch's built-in sdpa module
- RoPE / AliBi embeddings
training loop:
- separate configuration class for the Trainer; all training loop switches and knobs, easily reachable in one place
- pytorch's dataloader (pinned memory, multi-process workers)
- collation
- model checkpointing every x number of steps (configurable)
- switchable dirs for data, training checkpoints, saved model, logs (exposed as program arguments)
- seed/RNG mechanism set up for reproducibility
- visualization through either wandb or tensorboard (loss, lr plots, memory used etc.)
- model evaluation
- model + logs saved at the end of training (using safetensors)
- non-blocking / async CPU->GPU transfer (data.to(device, non_blocking=True))
distributed training:
- pytorch distributed sampler
- DistributedDataParallel
- FullyShardedDataParallel (this includes distributed checkpointing + loading, mixed precision, + few other things required by FSDP under the hood)
- TensorParallel
tokenizers:
- HuggingFace tokenizers; these should be much faster than the SentencePiece implementation
datasets:
- HuggingFace datasets; these are easier to use, without affecting performance
others:
- logging utilities; this should make it easier to adjust log levels, who logs, outputs, where and how data is logged etc.
- distributed utilities; things such as main_proc_first(), main_proc_only() etc.

Wants to merge: feat/overhaul into main

Type of change

Bug fix
New feature
Enhancement
Documentation update
Other (specify right below)

Checklist

I have tested the code with the changes manually.
My code follows the project's style guidelines.
I have documented my code for others to understand.
I have updated documentation as needed (including README.md, code comments and doc strings).

Reviewer Guidelines

All of the code should be tested on multiple scenarios, including single-GPU, single-node multi-GPU, and multi-node + a combination of changing switches around.

@mentions

@vlad_andrei.badoiu1 @alexandru.agache

Edited 10 months ago by Alexandru-Mihai GHERGHESCU

Merge request reports

Activity

Filter activity

Approvals
Assignees & reviewers
Comments (from bots)
Comments (from users)
Commits & branches
Edits
Labels
Lock status
Mentions
Merge request status
Tracking

Please register or sign in to reply

0 Assignees

None

Select assignees

0 Reviewers

Request review from

Loading

Labels

0

None

0

None

Select labels

Manage project labels

Milestone

None

None

None

Time tracking

No estimate or time spent

0

0 Participants

Loading