Skip to content

Re-factor optimus-prime code (optimus-prime v2)

Alexandru-Mihai GHERGHESCU requested to merge feat/overhaul into main

Pull Request Title

Complete training code overhaul

Description

Implements a bunch of big and small changes. Most of the ideas are modeled after HuggingFace, which has a very good design pattern for training code. This PR tracks features/fixes/improvements accordingly:

  • model improvements:
    • separate configuration class for the Optimus model; all model switches and knobs, easily reachable in one place; initialize a model from a config.json file
    • activation / gradient checkpointing
    • flash attention 2 using pytorch's built-in sdpa module
    • RoPE / AliBi embeddings
  • training loop:
    • separate configuration class for the Trainer; all training loop switches and knobs, easily reachable in one place
    • pytorch's dataloader (pinned memory, multi-process workers)
    • collation
    • model checkpointing every x number of steps (configurable)
    • switchable dirs for data, training checkpoints, saved model, logs (exposed as program arguments)
    • seed/RNG mechanism set up for reproducibility
    • visualization through either wandb or tensorboard (loss, lr plots, memory used etc.)
    • model evaluation
    • model + logs saved at the end of training (using safetensors)
    • non-blocking / async CPU->GPU transfer (data.to(device, non_blocking=True))
  • distributed training:
    • pytorch distributed sampler
    • DistributedDataParallel
    • FullyShardedDataParallel (this includes distributed checkpointing + loading, mixed precision, + few other things required by FSDP under the hood)
    • TensorParallel
  • tokenizers:
    • HuggingFace tokenizers; these should be much faster than the SentencePiece implementation
  • datasets:
    • HuggingFace datasets; these are easier to use, without affecting performance
  • others:
    • logging utilities; this should make it easier to adjust log levels, who logs, outputs, where and how data is logged etc.
    • distributed utilities; things such as main_proc_first(), main_proc_only() etc.

Wants to merge: feat/overhaul into main

Type of change

  • Bug fix
  • New feature
  • Enhancement
  • Documentation update
  • Other (specify right below)

Checklist

  • I have tested the code with the changes manually.
  • My code follows the project's style guidelines.
  • I have documented my code for others to understand.
  • I have updated documentation as needed (including README.md, code comments and doc strings).

Reviewer Guidelines

All of the code should be tested on multiple scenarios, including single-GPU, single-node multi-GPU, and multi-node + a combination of changing switches around.

@mentions

@vlad_andrei.badoiu1 @alexandru.agache

Edited by Alexandru-Mihai GHERGHESCU

Merge request reports