Draft: WIP Training checkpointing
Compare changes
+ 12
− 10
@@ -92,10 +92,10 @@ class TrainerState():
@@ -249,14 +249,14 @@ class Trainer():
@@ -302,12 +302,12 @@ class Trainer():
@@ -317,6 +317,8 @@ class Trainer():
Training checkpointing
Wants to merge: training_checkpointing into main
Training checkpointing: Save / resume training from checkpoints (DDP / FSDP aware; DDP should only save the model on the main rank, while FSDP will most likely save sharded checkpoints). Save the model + tokenizer at the end of training (this should save the model in an easily loadable format, using either pytorch or safe-tensors, regardless of DDP / FSDP).
See issue #21.
README.md
, code
comments and doc strings).