Draft: WIP Training checkpointing
Pull Request Title
Training checkpointing
Description
Wants to merge: training_checkpointing into main
Training checkpointing: Save / resume training from checkpoints (DDP / FSDP aware; DDP should only save the model on the main rank, while FSDP will most likely save sharded checkpoints). Save the model + tokenizer at the end of training (this should save the model in an easily loadable format, using either pytorch or safe-tensors, regardless of DDP / FSDP).
Type of change
-
Bug fix -
New feature -
Enhancement -
Documentation update -
Other (specify right below)
Related Issues
See issue #21.
Screenshots or GIFs
Checklist
-
I have tested the code with the changes manually. -
My code follows the project's style guidelines. -
I have documented my code for others to understand. -
I have updated documentation as needed (including README.md
, code comments and doc strings).