Re-factor optimus-prime code (optimus-prime v2)
Pull Request Title
Complete training code overhaul
Description
Implements a bunch of big and small changes. Most of the ideas are modeled after HuggingFace, which has a very good design pattern for training code. This PR tracks features/fixes/improvements accordingly:
- model improvements:
-
separate configuration class for the Optimus model; all model switches and knobs, easily reachable in one place; initialize a model from a config.json
file -
activation / gradient checkpointing -
flash attention 2 using pytorch's built-in sdpa module -
RoPE / AliBi embeddings
-
- training loop:
-
separate configuration class for the Trainer; all training loop switches and knobs, easily reachable in one place -
pytorch's dataloader (pinned memory, multi-process workers) -
collation -
model checkpointing every x number of steps (configurable) -
switchable dirs for data, training checkpoints, saved model, logs (exposed as program arguments) -
seed/RNG mechanism set up for reproducibility -
visualization through either wandb or tensorboard (loss, lr plots, memory used etc.) -
model evaluation -
model + logs saved at the end of training (using safetensors) -
non-blocking / async CPU->GPU transfer ( data.to(device, non_blocking=True)
)
-
- distributed training:
-
pytorch distributed sampler -
DistributedDataParallel -
FullyShardedDataParallel (this includes distributed checkpointing + loading, mixed precision, + few other things required by FSDP under the hood) -
TensorParallel
-
- tokenizers:
-
HuggingFace tokenizers; these should be much faster than the SentencePiece implementation
-
- datasets:
-
HuggingFace datasets; these are easier to use, without affecting performance
-
- others:
-
logging utilities; this should make it easier to adjust log levels, who logs, outputs, where and how data is logged etc. -
distributed utilities; things such as main_proc_first()
,main_proc_only()
etc.
-
Wants to merge: feat/overhaul into main
Type of change
-
Bug fix -
New feature -
Enhancement -
Documentation update -
Other (specify right below)
Checklist
-
I have tested the code with the changes manually. -
My code follows the project's style guidelines. -
I have documented my code for others to understand. -
I have updated documentation as needed (including README.md
, code comments and doc strings).
Reviewer Guidelines
All of the code should be tested on multiple scenarios, including single-GPU, single-node multi-GPU, and multi-node + a combination of changing switches around.
@mentions
Edited by Alexandru-Mihai GHERGHESCU