Re-factor optimus-prime code (optimus-prime v2)
Loading
Complete training code overhaul
Implements a bunch of big and small changes. Most of the ideas are modeled after HuggingFace, which has a very good design pattern for training code. This PR tracks features/fixes/improvements accordingly:
config.json
filedata.to(device, non_blocking=True)
)main_proc_first()
, main_proc_only()
etc.Wants to merge: feat/overhaul into main
README.md
, code
comments and doc strings).All of the code should be tested on multiple scenarios, including single-GPU, single-node multi-GPU, and multi-node + a combination of changing switches around.