Skip to content
Snippets Groups Projects

Draft: WIP Training checkpointing

Open Alexandru-Mihai GHERGHESCU requested to merge training_checkpointing into main

Pull Request Title

Training checkpointing

Description

Wants to merge: training_checkpointing into main

Training checkpointing: Save / resume training from checkpoints (DDP / FSDP aware; DDP should only save the model on the main rank, while FSDP will most likely save sharded checkpoints). Save the model + tokenizer at the end of training (this should save the model in an easily loadable format, using either pytorch or safe-tensors, regardless of DDP / FSDP).

Type of change

  • Bug fix
  • New feature
  • Enhancement
  • Documentation update
  • Other (specify right below)

Related Issues

See issue #21.

Screenshots or GIFs

Checklist

  • I have tested the code with the changes manually.
  • My code follows the project's style guidelines.
  • I have documented my code for others to understand.
  • I have updated documentation as needed (including README.md, code comments and doc strings).

Reviewer Guidelines

Additional Notes

@mentions

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
Please register or sign in to reply
Loading