Commits · vladb/ddp · NetSys / Optimus Prime

Feb 18, 2024
- Adapt optimus to Distributon · bf875249
  Vlad-Andrei BĂDOIU (78692) authored 1 year ago
  
  This commit adapts the existing code to use the distributed library via a config option. To achieve this we switch to using Pytorch's dataloader.
  bf875249
- Introduce the module for distributed training · 6d3a3004
  Vlad-Andrei BĂDOIU (78692) authored 1 year ago
  
  6d3a3004
Feb 15, 2024

Adjust optimizer epsilon value for AMP · 8579fc15

Alexandru-Mihai GHERGHESCU authored 1 year ago

Pick a better default as epsilon value. Although this value should never
touch the fp16 gradients in mixed precision training (as the optimizer
should only ever work on the master fp32 copy of the model), this value
didn't need to be changed. However, in pure fp16 training, any epsilon
value lower than 1e-7 would simply underflow to 0, causing it to become
useless.

Although the framework doesn't directly support the second case above,
an epsilon value of 1e-7 seems like a better default for both AMP and
normal training.

Unverified

8579fc15

Add fp16 mixed precision training · 6db26eb1

Alexandru-Mihai GHERGHESCU authored 1 year ago

This should give training a theoretical 2x speedup in time (though in
practice that's not usually the case), with close to no loss in
performance.

The interface allows the user to choose between mixed precision or no
mixed precision training, which falls back to normal float32 precision.

CPU support for training has been dropped, as it takes (with or without
mixed precision) much much longer to train than on GPU's, and it's not
really an alternative anyone considers. With the addition of mixed
precision, supporting both CPU and GPU would complicate things too much,
therefore CPU training support has been dropped.

Unverified

6db26eb1

Jan 30, 2024
- Merge branch 'fix/estimation_interval' into 'main' · fe76efab
  Vlad-Andrei BĂDOIU (78692) authored 1 year ago
  
  Fix estimation interval See merge request !16
  fe76efab
Jan 29, 2024
- Add gitlab CI/CD file · e4592038
  Vlad-Andrei BĂDOIU (78692) authored 1 year ago
  
  e4592038
- Fix estimation interval · 3852611c
  Alexandru-Mihai GHERGHESCU authored 1 year ago
  
  Fix a bug where the estimation interval would be 0. This only happened for (very) small datasets, with gradient accumulation steps different than 1.
  Unverified
  
  3852611c
Jan 28, 2024
- Merge branch 'feature/inference' into 'main' · 5bc6558f
  Vlad-Andrei BĂDOIU (78692) authored 1 year ago
  
  Add inference code See merge request !15
  5bc6558f
Jan 26, 2024

Add tokens per second information · cbf807dd
Alexandru-Mihai GHERGHESCU authored 1 year ago
```
Output model tokens per second at the end of inference.
```
Unverified

cbf807dd

Add option to pass a prompt to the inference script · 0ee3b108

Alexandru-Mihai GHERGHESCU authored 1 year ago

This allows the inference code to start up with a prompt, instead of
waiting for user input from stdin. Allows easier scripting, useful for
batch generation, benchmarking etc.

Unverified

0ee3b108

Jan 25, 2024
- Add inference docs to README.md · 3ab5c4fd
  Alexandru-Mihai GHERGHESCU authored 1 year ago
  
  Unverified
  
  3ab5c4fd
- Add inference code · accab39c
  Alexandru-Mihai GHERGHESCU authored 1 year ago
  
  Inference example code. At the moment, the code simply loads a model state file and generates text using that. Parameters like max sequence length, whether training used fp16, what the tokenizer used for training is etc., need to be passed manually by the user (there's a lot of room for error here). To be improved. Merges changes from !14 Closes !14
  Unverified
  
  accab39c
- Revert "Merge branch 'vladb/add_inference' into 'main'" · 9d1cb007
  Vlad-Andrei BĂDOIU (78692) authored 1 year ago
  
  This reverts commit cb893907, reversing changes made to 83f7b518.
  9d1cb007
- Merge branch 'feature/folder_structure' into 'main' · 695f8610
  Vlad-Andrei BĂDOIU (78692) authored 1 year ago
  
  Restructure project See merge request !13
  695f8610
- Merge branch 'vladb/add_inference' into 'main' · cb893907
  Vlad-Andrei BĂDOIU (78692) authored 1 year ago
  
  Add inference code See merge request !10
  cb893907
- Update README file · 174ba1aa
  Alexandru-Mihai GHERGHESCU authored 1 year ago and Vlad-Andrei BĂDOIU (78692) committed 1 year ago
  
  174ba1aa
- Add project LICENSE · 7c9e0807
  Alexandru-Mihai GHERGHESCU authored 1 year ago and Vlad-Andrei BĂDOIU (78692) committed 1 year ago
  
  7c9e0807
- Restructure project · 6f6dba65
  Alexandru-Mihai GHERGHESCU authored 1 year ago and Vlad-Andrei BĂDOIU (78692) committed 1 year ago
  
  Reorganize the folder structure to make the project look like an actual library. Move training example outside of framework code.
  6f6dba65
- Merge branch 'feature/merge_request_template' into 'main' · 83f7b518
  Vlad-Andrei BĂDOIU (78692) authored 1 year ago
  
  Add merge request template See merge request !12
  83f7b518
- Merge branch 'fix/datasets' into 'main' · c526355a
  Vlad-Andrei BĂDOIU (78692) authored 1 year ago
  
  Fix datasets memory issues See merge request !9
  c526355a
- Merge branch 'fix/general_small_fixes' into 'main' · ba87b2ba
  Vlad-Andrei BĂDOIU (78692) authored 1 year ago
  
  Fix a number of issues with the infrastructure, no major rework See merge request !11
  ba87b2ba
Jan 24, 2024

Add merge request template · 6e7bd2f6

Alexandru-Mihai GHERGHESCU authored 1 year ago

Add a merge request template which aids in contributing to the codebase.

Also see https://docs.gitlab.com/ee/user/project/description_templates.html.

Unverified

6e7bd2f6

Fix final training loss calculation, fix estimation interval · 64302265

Alexandru-Mihai GHERGHESCU authored 1 year ago

Visual change, correctly display final training loss.

The final training loss didn't account for gradient accumulation, and
was therefore much smaller than it should've been in reality.

Fix the estimation interval, which was also not properly calculated due
to gradient accumulation.

Unverified

64302265

Add inference code · 22338e60
Vlad-Andrei BĂDOIU (78692) authored 1 year ago

22338e60

Fix bad calculation for number of batches · 7aa99b4a

Alexandru-Mihai GHERGHESCU authored 1 year ago

There was a corner case when the shape of the predictions y of the
dataset would not be correct, due to the fact that the number of batches
was miscalculated.

This happened when `batch_len` was exactly divisible by `seq_len`, since
the predictions, which are simply the text shifted once to the right,
would not have that extra column at the end.

Fix the above issue by decrementing the number of available batches with
1 when `batch_len` exactly divides by `seq_len`.

Unverified

7aa99b4a

Ignore last batches when calculating final train loss · 4ab91bcf

Alexandru-Mihai GHERGHESCU authored 1 year ago

Visual change. This only changes what the trainer reports as the final
training loss.

Not quite sure if the value before was accurate anyway, since gradient
accumulation would not let the optimizer step every batch anyway.

For a big enough dataset, this should not have any impact at all.

The final loss value will be reported based on the last calculation of
the loss, correctly taking into consideration gradient accumulation as
well.

Unverified

4ab91bcf

Jan 22, 2024

Correctly display dataset samples · ee95daff
Alexandru-Mihai GHERGHESCU authored 1 year ago
```
This takes into account endlines too. Just a visual accuracy change.
```
Unverified

ee95daff

Fix a number of type issues · a092db0a

Alexandru-Mihai GHERGHESCU authored 1 year ago

Fix problems with some types. This enables Python's static type checks
to correctly identify some issues before runtime.

Unverified

a092db0a

Rewrite some comments in model.py · 5521d924
Alexandru-Mihai GHERGHESCU authored 1 year ago

Unverified

5521d924

Save model arguments as PyTorch buffers · edd2a1be

Alexandru-Mihai GHERGHESCU authored 1 year ago

Add model constructor arguments (n_layers, n_heads, dim etc.) as pytorch
buffers. This packs them together with the model weights when calling
`torch.save()`, and loads them back in when calling `torch.load()`.

Eventually, these should be saved separately, however this will do for
now.

Unverified

edd2a1be

Fix order of parameters · 76724cbc
Alexandru-Mihai GHERGHESCU authored 1 year ago
```
Visual change.
```
Unverified

76724cbc
Add matplotlib package dependency in README · 96a22edb
Alexandru-Mihai GHERGHESCU authored 1 year ago

Unverified

96a22edb

Add note about choosing which GPU to train the model on · 4107d09d

Alexandru-Mihai GHERGHESCU authored 1 year ago

Add note in the README file about using the environment variable
CUDA_VISIBLE_DEVICES, which lets the user choose which GPU to run
training on.

Unverified

4107d09d

Fix problem with RMSNorm layer · d16f0e73

Alexandru-Mihai GHERGHESCU authored 1 year ago

The normalization layer returned float32 tensors, instead of fp16
tensors, which should've been the case when training with mixed
precision. This raised a runtime error of incompatible types. Rearrange
the operations to properly compute the norm in float32, but return the
value in fp16.

Unverified

d16f0e73

Fix missing variable initialization · 694d4e3e
Alexandru-Mihai GHERGHESCU authored 1 year ago
```
Total loss wasn't properly initialized, leading to a runtime error.
```
Unverified

694d4e3e

Fix datasets memory issues · f7b83738

Alexandru-Mihai GHERGHESCU authored 1 year ago

Fix an issue where the whole files were read in memory at once. E.g.
reading TinyStories train dataset (a 2.2GB file) would fill up 20GB of
RAM due to variable allocation inside Python.

The fix uses I/O buffering and reads lines one by one, processing them
at the same time. This leads to a much lower RAM usage (around the
size of the file), and also increases processing speed.

Unverified

f7b83738

Jan 18, 2024
- Merge branch 'fix/tokenizer' into 'main' · f3c62726
  Vlad-Andrei BĂDOIU (78692) authored 1 year ago
  
  Fix tokenizer typos, add newlines See merge request !8
  f3c62726
- Merge branch 'feature/grad_acc' into 'main' · a9551cbd
  Vlad-Andrei BĂDOIU (78692) authored 1 year ago
  
  Gradient accumulation See merge request !6
  a9551cbd
Jan 12, 2024
- Re-train wikitext103 tokenizer models with newlines · 094322f6
  Alexandru-Mihai GHERGHESCU authored 1 year ago
  
  Unverified
  
  094322f6
- Rename wikitext103 tokenizer model to correct name · 4137e81a
  Alexandru-Mihai GHERGHESCU authored 1 year ago
  
  32K vocab -> 16K vocab
  Unverified
  
  4137e81a