Commits · 7aa99b4aef7c66710bb827e77478c55871f7434f · NetSys / Optimus Prime

Jan 24, 2024

Fix bad calculation for number of batches · 7aa99b4a

Alexandru-Mihai GHERGHESCU authored 1 year ago

There was a corner case when the shape of the predictions y of the
dataset would not be correct, due to the fact that the number of batches
was miscalculated.

This happened when `batch_len` was exactly divisible by `seq_len`, since
the predictions, which are simply the text shifted once to the right,
would not have that extra column at the end.

Fix the above issue by decrementing the number of available batches with
1 when `batch_len` exactly divides by `seq_len`.

7aa99b4a

Ignore last batches when calculating final train loss · 4ab91bcf

Alexandru-Mihai GHERGHESCU authored 1 year ago

Visual change. This only changes what the trainer reports as the final
training loss.

Not quite sure if the value before was accurate anyway, since gradient
accumulation would not let the optimizer step every batch anyway.

For a big enough dataset, this should not have any impact at all.

The final loss value will be reported based on the last calculation of
the loss, correctly taking into consideration gradient accumulation as
well.

4ab91bcf

Jan 22, 2024

Fix a number of type issues · a092db0a

Alexandru-Mihai GHERGHESCU authored 1 year ago

Fix problems with some types. This enables Python's static type checks
to correctly identify some issues before runtime.

a092db0a

Rewrite some comments in model.py · 5521d924
Alexandru-Mihai GHERGHESCU authored 1 year ago

5521d924

Save model arguments as PyTorch buffers · edd2a1be

Alexandru-Mihai GHERGHESCU authored 1 year ago

Add model constructor arguments (n_layers, n_heads, dim etc.) as pytorch
buffers. This packs them together with the model weights when calling
`torch.save()`, and loads them back in when calling `torch.load()`.

Eventually, these should be saved separately, however this will do for
now.

edd2a1be

Fix order of parameters · 76724cbc
Alexandru-Mihai GHERGHESCU authored 1 year ago
```
Visual change.
```
76724cbc
Add matplotlib package dependency in README · 96a22edb
Alexandru-Mihai GHERGHESCU authored 1 year ago

96a22edb

Add note about choosing which GPU to train the model on · 4107d09d

Alexandru-Mihai GHERGHESCU authored 1 year ago

Add note in the README file about using the environment variable
CUDA_VISIBLE_DEVICES, which lets the user choose which GPU to run
training on.

4107d09d

Fix problem with RMSNorm layer · d16f0e73

Alexandru-Mihai GHERGHESCU authored 1 year ago

The normalization layer returned float32 tensors, instead of fp16
tensors, which should've been the case when training with mixed
precision. This raised a runtime error of incompatible types. Rearrange
the operations to properly compute the norm in float32, but return the
value in fp16.

d16f0e73

Fix missing variable initialization · 694d4e3e
Alexandru-Mihai GHERGHESCU authored 1 year ago
```
Total loss wasn't properly initialized, leading to a runtime error.
```
694d4e3e

Jan 18, 2024
- Merge branch 'fix/tokenizer' into 'main' · f3c62726
  Vlad-Andrei BĂDOIU (78692) authored 1 year ago
  
  Fix tokenizer typos, add newlines See merge request !8
  f3c62726
- Merge branch 'feature/grad_acc' into 'main' · a9551cbd
  Vlad-Andrei BĂDOIU (78692) authored 1 year ago
  
  Gradient accumulation See merge request !6
  a9551cbd
Jan 12, 2024
- Re-train wikitext103 tokenizer models with newlines · 094322f6
  Alexandru-Mihai GHERGHESCU authored 1 year ago
  
  094322f6
- Rename wikitext103 tokenizer model to correct name · 4137e81a
  Alexandru-Mihai GHERGHESCU authored 1 year ago
  
  32K vocab -> 16K vocab
  4137e81a
- Fix tokenizer typos, add newlines · cd367fd3
  Alexandru-Mihai GHERGHESCU authored 1 year ago
  
  Fix some typos in the tokenizer file. Add newlines and whitespaces to the tokenizer model. Previously, all the whitespace was stripped and joined into a single blank. This allows for better tokenization for things like wikitext103, which has articles containing newlines, with relevance.
  cd367fd3
Jan 11, 2024
- Merge branch 'fix/model' into 'main' · e4d66b1d
  Vlad-Andrei BĂDOIU (78692) authored 1 year ago
  
  Fix small typos in the model architecture See merge request !7
  e4d66b1d
Jan 09, 2024
- Fix small typos in the model architecture · 7c904d20
  Alexandru-Mihai GHERGHESCU authored 1 year ago
  
  7c904d20
- Add gradient accumulation · 462ed987
  Alexandru-Mihai GHERGHESCU authored 1 year ago
  
  Add gradient accumulation to the training loop. The number of gradient accumulation steps is exposed by the trainer.
  462ed987
- Expose gradient clipping norm value as a parameter · 18b321f3
  Alexandru-Mihai GHERGHESCU authored 1 year ago
  
  Make the gradient clipping norm value a parameter fed to the trainer.
  18b321f3
Jan 06, 2024
- Merge branch 'fix/wikitext_dataset' into 'main' · 39950c13
  Vlad-Andrei BĂDOIU (78692) authored 1 year ago
  
  Fix some issues with the wikitext103 dataset See merge request !4
  39950c13
- Merge branch 'vladb/tiniystories' into 'main' · 964eccd4
  Vlad-Andrei BĂDOIU (78692) authored 1 year ago
  
  Add tinystories dataset See merge request !3
  964eccd4
- Merge branch 'feature/progress' into 'main' · d81b0b28
  Vlad-Andrei BĂDOIU (78692) authored 1 year ago
  
  Add progress bar display for training See merge request !2
  d81b0b28
Jan 05, 2024

Fix some issues with the wikitext103 dataset · c9dd8feb

Alexandru-Mihai GHERGHESCU authored 1 year ago

Couple of things:
- rewrite code to better check when the dataset is downloaded
- better cleanup after download + unzip
- more aggresive exit on checksum mismatch
- rewrite __main__

c9dd8feb

Fix a few issues with the TinyStories dataset file · 400d138a

Alexandru-Mihai GHERGHESCU authored 1 year ago

Couple of things, mostly for code consistency and clarity:
- reorganize imports
- reorganize initial global variables (URL, MD5 etc.)
- rename class to contain "Dataset"
- fix comments

There are also a few things which I added / replaced / removed, upon
re-consideration of how datasets should work:
- add additional folder "tinystories" where to download the .txt files
- remove the pandas DataFrame
- rewrite __main__ example
- be more aggresive when checksums for downloaded files don't match

400d138a

Jan 03, 2024
- Add tinystories dataset · e0f91e94
  Vlad-Andrei BĂDOIU (78692) authored 1 year ago
  
  e0f91e94
Jan 02, 2024
- Merge branch 'feature/pytorch' into 'main' · 17c2eda3
  Vlad-Andrei BĂDOIU (78692) authored 1 year ago
  
  Rewrite training loop in PyTorch See merge request !1
  17c2eda3
Dec 28, 2023

Add progress bar display for training · faecfbce

Alexandru-Mihai GHERGHESCU authored 1 year ago

Use fastai's fastprogress package to display a progress bar while
training, with useful information such as loss, estimated time of
training, current learning rate, estimated ms/batch.

Print end of epoch stats when finishing an epoch.

Add a relevant parameter for the trainer to enable/disable the progress
bar display.

faecfbce

Dec 27, 2023

Fix missing parenthesis in README · aba10a3a
Alexandru-Mihai GHERGHESCU authored 1 year ago

aba10a3a

Add a training example script · 8588121d

Alexandru-Mihai GHERGHESCU authored 1 year ago

Add an example of what training using the current code would look like.
Most of this script can be copied and adapted for other datasets, or for
evaluating/testing different Transformer models etc.

8588121d

Add Transformer model · 9b034462

Alexandru-Mihai GHERGHESCU authored 1 year ago

The model is mostly modeled after the LLama 2 transformer, though it
misses a couple of things (grouped-query attention, KV cache for
inference, and rotational encodings). These will eventually make it into
Optimus code. At that point, the model might as well be called LLoptimus.

9b034462

Add SentencePiece tokenizer models · 62388e49

Alexandru-Mihai GHERGHESCU authored 1 year ago

Add Llama's 32K vocab tokenizer, as well as 2 Optimus variants trained
on WikiText103 data: a 32K vocab tokenizer, and a 60K vocab tokenizer.
Both Optimus tokenizers are unigram models.

62388e49

Add WikiText103 as an example dataset · 8bd32367

Alexandru-Mihai GHERGHESCU authored 1 year ago

Add WikiText103 as an example of what a Dataset needs to look like, for
us to be able to use it in the training loop. Other Dataset's can
probably directly copy most of the code, and modify small parts of it as
needed.

8bd32367

Add common dataset utils · ada0b105
Alexandru-Mihai GHERGHESCU authored 1 year ago
```
Add a few common functions that can be used by whatever dataset we need.
```
ada0b105

Add trainer code · 5c777810

Alexandru-Mihai GHERGHESCU authored 1 year ago

Add a training loop, written from scratch. Currently, it is quite
bare-bones (trains in FP32, no gradient accumulation, no parallel
training etc.), but eventually this will be improved with other
must-have things.

5c777810

Add OptimusDataLoader · 9c1e9eec

Alexandru-Mihai GHERGHESCU authored 1 year ago

This is a custom dataloader class, similar to pytorch's DataLoader, but
specialized in NLP tasks. Right now, it is pretty much written from
scratch, but eventually we want to use the built-in DataLoader, since it
has some nice goodies attached to it (like data
prefetching/preprocesing, serving data for parallel training etc.).

9c1e9eec

Add SentencePiece tokenizer · 94f5c268
Alexandru-Mihai GHERGHESCU authored 1 year ago

94f5c268
Add README file, remove old notebook · 1b211799
Alexandru-Mihai GHERGHESCU authored 1 year ago

1b211799
Add gitignore file · 60f2de5d
Alexandru-Mihai GHERGHESCU authored 1 year ago

60f2de5d

Nov 24, 2023

Add parallel training · a9af6ee0

Alexandru-Mihai GHERGHESCU authored 1 year ago

Add parallel training on multiple GPU's through pytorch's
DistributedDataParallel (Pytorch DDP).

a9af6ee0

Nov 08, 2023

Couple of accuracy + speed changes · 1717090f

Alexandru-Mihai GHERGHESCU authored 1 year ago

Some changes:
- label smoothing
- root-mean square norm instead of layer norm
- move norm to before layer instead of after, and add final norm layer
- remove attention for-loop, instead do a big matrix multiplication
- remove bias terms from linear layers
- add dropout
- remove and rename model parameters (easier to use in code)
- add weight tying
- add gradient accumulation (change to lower batch size and higher
  sequence length)
- add model checkpoint
- add gradient clipping
- move warmup steps to 15% of total steps, and change learning rate
  accordingly
- move to floating point 16 bits (fp16); faster training for nvidia
  GPU's
- plot final loss and learning rate scheduling

1717090f