Fix a number of issues with the infrastructure, no major rework (!11) · Merge requests · NetSys / Optimus Prime · GitLab

Snippets Groups Projects

Merged Alexandru-Mihai GHERGHESCU requested to merge fix/general_small_fixes into main 1 year ago

Jan 24, 2024

Fix final training loss calculation, fix estimation interval · 64302265

Alexandru-Mihai GHERGHESCU authored 1 year ago

Visual change, correctly display final training loss.

The final training loss didn't account for gradient accumulation, and
was therefore much smaller than it should've been in reality.

Fix the estimation interval, which was also not properly calculated due
to gradient accumulation.

64302265

Fix bad calculation for number of batches · 7aa99b4a

Alexandru-Mihai GHERGHESCU authored 1 year ago

There was a corner case when the shape of the predictions y of the
dataset would not be correct, due to the fact that the number of batches
was miscalculated.

This happened when `batch_len` was exactly divisible by `seq_len`, since
the predictions, which are simply the text shifted once to the right,
would not have that extra column at the end.

Fix the above issue by decrementing the number of available batches with
1 when `batch_len` exactly divides by `seq_len`.

7aa99b4a

Ignore last batches when calculating final train loss · 4ab91bcf

Alexandru-Mihai GHERGHESCU authored 1 year ago

Visual change. This only changes what the trainer reports as the final
training loss.

Not quite sure if the value before was accurate anyway, since gradient
accumulation would not let the optimizer step every batch anyway.

For a big enough dataset, this should not have any impact at all.

The final loss value will be reported based on the last calculation of
the loss, correctly taking into consideration gradient accumulation as
well.

4ab91bcf

Jan 22, 2024

Fix a number of type issues · a092db0a

Alexandru-Mihai GHERGHESCU authored 1 year ago

Fix problems with some types. This enables Python's static type checks
to correctly identify some issues before runtime.

a092db0a

Rewrite some comments in model.py · 5521d924
Alexandru-Mihai GHERGHESCU authored 1 year ago

Unverified

5521d924

Save model arguments as PyTorch buffers · edd2a1be

Alexandru-Mihai GHERGHESCU authored 1 year ago

Add model constructor arguments (n_layers, n_heads, dim etc.) as pytorch
buffers. This packs them together with the model weights when calling
`torch.save()`, and loads them back in when calling `torch.load()`.

Eventually, these should be saved separately, however this will do for
now.

edd2a1be

Fix order of parameters · 76724cbc
Alexandru-Mihai GHERGHESCU authored 1 year ago
```
Visual change.
```
Unverified

76724cbc
Add matplotlib package dependency in README · 96a22edb
Alexandru-Mihai GHERGHESCU authored 1 year ago

Unverified

96a22edb

Add note about choosing which GPU to train the model on · 4107d09d

Alexandru-Mihai GHERGHESCU authored 1 year ago

Add note in the README file about using the environment variable
CUDA_VISIBLE_DEVICES, which lets the user choose which GPU to run
training on.

4107d09d

Fix problem with RMSNorm layer · d16f0e73

Alexandru-Mihai GHERGHESCU authored 1 year ago

The normalization layer returned float32 tensors, instead of fp16
tensors, which should've been the case when training with mixed
precision. This raised a runtime error of incompatible types. Rearrange
the operations to properly compute the norm in float32, but return the
value in fp16.

d16f0e73

Fix missing variable initialization · 694d4e3e
Alexandru-Mihai GHERGHESCU authored 1 year ago
```
Total loss wasn't properly initialized, leading to a runtime error.
```
Unverified

694d4e3e