Add gradient checkpointing option to Optimus
Gradient (or activation) checkpointing trades compute for memory saved. This should overall make it easier to train large models on not-so-large hardware. Add checkpointing to every layer (same as HuggingFace), as opposed to every 2/3 layers, since 1) this is the easiest to implement, and 2) has the best balance between memory/compute.
Loading
Please register or sign in to comment