Fix a number of issues with the infrastructure, no major rework
- Jan 24, 2024
-
-
Alexandru-Mihai GHERGHESCU authored
Visual change, correctly display final training loss. The final training loss didn't account for gradient accumulation, and was therefore much smaller than it should've been in reality. Fix the estimation interval, which was also not properly calculated due to gradient accumulation.
-
Alexandru-Mihai GHERGHESCU authored
There was a corner case when the shape of the predictions y of the dataset would not be correct, due to the fact that the number of batches was miscalculated. This happened when `batch_len` was exactly divisible by `seq_len`, since the predictions, which are simply the text shifted once to the right, would not have that extra column at the end. Fix the above issue by decrementing the number of available batches with 1 when `batch_len` exactly divides by `seq_len`.
-
Alexandru-Mihai GHERGHESCU authored
Visual change. This only changes what the trainer reports as the final training loss. Not quite sure if the value before was accurate anyway, since gradient accumulation would not let the optimizer step every batch anyway. For a big enough dataset, this should not have any impact at all. The final loss value will be reported based on the last calculation of the loss, correctly taking into consideration gradient accumulation as well.
-
- Jan 22, 2024
-
-
Alexandru-Mihai GHERGHESCU authored
Fix problems with some types. This enables Python's static type checks to correctly identify some issues before runtime.
-
Alexandru-Mihai GHERGHESCU authored
-
Alexandru-Mihai GHERGHESCU authored
Add model constructor arguments (n_layers, n_heads, dim etc.) as pytorch buffers. This packs them together with the model weights when calling `torch.save()`, and loads them back in when calling `torch.load()`. Eventually, these should be saved separately, however this will do for now.
-
Alexandru-Mihai GHERGHESCU authored
Visual change.
-
Alexandru-Mihai GHERGHESCU authored
-
Alexandru-Mihai GHERGHESCU authored
Add note in the README file about using the environment variable CUDA_VISIBLE_DEVICES, which lets the user choose which GPU to run training on.
-
Alexandru-Mihai GHERGHESCU authored
The normalization layer returned float32 tensors, instead of fp16 tensors, which should've been the case when training with mixed precision. This raised a runtime error of incompatible types. Rearrange the operations to properly compute the norm in float32, but return the value in fp16.
-
Alexandru-Mihai GHERGHESCU authored
Total loss wasn't properly initialized, leading to a runtime error.
-