Snippets Groups Projects

An error occurred while fetching the assigned milestone of the selected merge_request.

Open Alexandru-Mihai GHERGHESCU requested to merge training_checkpointing into main 8 months ago

Pull Request Title

Training checkpointing

Description

Wants to merge: training_checkpointing into main

Training checkpointing: Save / resume training from checkpoints (DDP / FSDP aware; DDP should only save the model on the main rank, while FSDP will most likely save sharded checkpoints). Save the model + tokenizer at the end of training (this should save the model in an easily loadable format, using either pytorch or safe-tensors, regardless of DDP / FSDP).

Type of change

Related Issues

See issue #21.

Screenshots or GIFs

Checklist

I have tested the code with the changes manually.
My code follows the project's style guidelines.
I have documented my code for others to understand.
I have updated documentation as needed (including README.md, code comments and doc strings).

Reviewer Guidelines

Additional Notes

@mentions

Activity

Alexandru-Mihai GHERGHESCU assigned to @agherghescu2411 8 months ago

assigned to @agherghescu2411
Alexandru-Mihai GHERGHESCU changed title from Draft: Training checkpointing to Draft: WIP Training checkpointing 8 months ago

changed title from Draft: Training checkpointing to Draft: WIP Training checkpointing
Alexandru-Mihai GHERGHESCU added 1 commit 8 months ago
added 1 commit

2c9cac8e - WIP training checkpointing

Compare with previous version
Alexandru-Mihai GHERGHESCU added 1 commit 8 months ago
added 1 commit

58213d63 - WIP training checkpointing

Compare with previous version
Alexandru-Mihai GHERGHESCU added 1 commit 8 months ago
added 1 commit

72921f6d - WIP training checkpointing

Compare with previous version
Alexandru-Mihai GHERGHESCU added 1 commit 8 months ago
added 1 commit

32ac537c - WIP training checkpointing

Compare with previous version
Alexandru-Mihai GHERGHESCU added 1 commit 8 months ago
added 1 commit

2cdc022c - WIP training checkpointing

Compare with previous version
Alexandru-Mihai GHERGHESCU added 1 commit 8 months ago
added 1 commit

e74d1721 - WIP training checkpointing

Compare with previous version

Please register or sign in to reply