Add fp16 mixed precision training
Pull Request Title
Description
Wants to merge: feature/fp16 into main
Automatic mixed precision training using PyTorch's AMP module; yields a pretty good speed-up. However, memory usage is only about 5-10% lower. This happens because the AMP module decides not to convert most of the layers to fp16. Still investigating why that is, however I tend to believe it's just how the OptimusTransformer is implemented. With specialized layers (e.g. PyTorch's built-in MultiHeadAttention), I think the memory usage could get lower. This is probably good detective work for whoever feels like investigating memory usage.
Additionally, the model is saved to disk with fp16 weights. After testing, it looks like saving the weights as fp16 instead of fp32 yields sligtly lower performance (presumably because not all the layers were converted during training), however saving the model as such is supposed to be the correct way (since training used an fp16 model for the forward pass). Saved to disk as fp32 weights. I'm not exactly sure how the conversion to float16 weights should work. I'm not certain whether it's a simple model.to(float16)
or something more complex (like quantization) is required. For now, since the models we try are small, keep the weights on disk as fp32.
Type of change
-
Bug fix -
New feature -
Enhancement -
Documentation update -
Other (specify right below)
Merge request commits
(Edited out)
Related Issues
Screenshots or GIFs
Checklist
-
I have tested the code with the changes manually. -
My code follows the project's style guidelines. -
I have documented my code for others to understand. -
I have updated documentation as needed (including README.md
, code comments and doc strings).