Skip to content

Draft: WIP Training checkpointing

Alexandru-Mihai GHERGHESCU requested to merge training_checkpointing into main

Pull Request Title

Training checkpointing

Description

Wants to merge: training_checkpointing into main

Training checkpointing: Save / resume training from checkpoints (DDP / FSDP aware; DDP should only save the model on the main rank, while FSDP will most likely save sharded checkpoints). Save the model + tokenizer at the end of training (this should save the model in an easily loadable format, using either pytorch or safe-tensors, regardless of DDP / FSDP).

Type of change

  • Bug fix
  • New feature
  • Enhancement
  • Documentation update
  • Other (specify right below)

Related Issues

See issue #21.

Screenshots or GIFs

Checklist

  • I have tested the code with the changes manually.
  • My code follows the project's style guidelines.
  • I have documented my code for others to understand.
  • I have updated documentation as needed (including README.md, code comments and doc strings).

Reviewer Guidelines

Additional Notes

@mentions

Merge request reports