- Jan 24, 2024
-
-
Alexandru-Mihai GHERGHESCU authored
There was a corner case when the shape of the predictions y of the dataset would not be correct, due to the fact that the number of batches was miscalculated. This happened when `batch_len` was exactly divisible by `seq_len`, since the predictions, which are simply the text shifted once to the right, would not have that extra column at the end. Fix the above issue by decrementing the number of available batches with 1 when `batch_len` exactly divides by `seq_len`.
-
Alexandru-Mihai GHERGHESCU authored
Visual change. This only changes what the trainer reports as the final training loss. Not quite sure if the value before was accurate anyway, since gradient accumulation would not let the optimizer step every batch anyway. For a big enough dataset, this should not have any impact at all. The final loss value will be reported based on the last calculation of the loss, correctly taking into consideration gradient accumulation as well.
-
- Jan 22, 2024
-
-
Alexandru-Mihai GHERGHESCU authored
Fix problems with some types. This enables Python's static type checks to correctly identify some issues before runtime.
-
Alexandru-Mihai GHERGHESCU authored
-
Alexandru-Mihai GHERGHESCU authored
Add model constructor arguments (n_layers, n_heads, dim etc.) as pytorch buffers. This packs them together with the model weights when calling `torch.save()`, and loads them back in when calling `torch.load()`. Eventually, these should be saved separately, however this will do for now.
-
Alexandru-Mihai GHERGHESCU authored
Visual change.
-
Alexandru-Mihai GHERGHESCU authored
-
Alexandru-Mihai GHERGHESCU authored
Add note in the README file about using the environment variable CUDA_VISIBLE_DEVICES, which lets the user choose which GPU to run training on.
-
Alexandru-Mihai GHERGHESCU authored
The normalization layer returned float32 tensors, instead of fp16 tensors, which should've been the case when training with mixed precision. This raised a runtime error of incompatible types. Rearrange the operations to properly compute the norm in float32, but return the value in fp16.
-
Alexandru-Mihai GHERGHESCU authored
Total loss wasn't properly initialized, leading to a runtime error.
-
- Jan 18, 2024
-
-
Vlad-Andrei BĂDOIU (78692) authored
Fix tokenizer typos, add newlines See merge request !8
-
Vlad-Andrei BĂDOIU (78692) authored
Gradient accumulation See merge request !6
-
- Jan 12, 2024
-
-
Alexandru-Mihai GHERGHESCU authored
-
Alexandru-Mihai GHERGHESCU authored
32K vocab -> 16K vocab
-
Alexandru-Mihai GHERGHESCU authored
Fix some typos in the tokenizer file. Add newlines and whitespaces to the tokenizer model. Previously, all the whitespace was stripped and joined into a single blank. This allows for better tokenization for things like wikitext103, which has articles containing newlines, with relevance.
-
- Jan 11, 2024
-
-
Vlad-Andrei BĂDOIU (78692) authored
Fix small typos in the model architecture See merge request !7
-
- Jan 09, 2024
-
-
Alexandru-Mihai GHERGHESCU authored
-
Alexandru-Mihai GHERGHESCU authored
Add gradient accumulation to the training loop. The number of gradient accumulation steps is exposed by the trainer.
-
Alexandru-Mihai GHERGHESCU authored
Make the gradient clipping norm value a parameter fed to the trainer.
-
- Jan 06, 2024
-
-
Vlad-Andrei BĂDOIU (78692) authored
Fix some issues with the wikitext103 dataset See merge request !4
-
Vlad-Andrei BĂDOIU (78692) authored
Add tinystories dataset See merge request !3
-
Vlad-Andrei BĂDOIU (78692) authored
Add progress bar display for training See merge request !2
-
- Jan 05, 2024
-
-
Alexandru-Mihai GHERGHESCU authored
Couple of things: - rewrite code to better check when the dataset is downloaded - better cleanup after download + unzip - more aggresive exit on checksum mismatch - rewrite __main__
-
Alexandru-Mihai GHERGHESCU authored
Couple of things, mostly for code consistency and clarity: - reorganize imports - reorganize initial global variables (URL, MD5 etc.) - rename class to contain "Dataset" - fix comments There are also a few things which I added / replaced / removed, upon re-consideration of how datasets should work: - add additional folder "tinystories" where to download the .txt files - remove the pandas DataFrame - rewrite __main__ example - be more aggresive when checksums for downloaded files don't match
-
- Jan 03, 2024
-
-
Vlad-Andrei BĂDOIU (78692) authored
-
- Jan 02, 2024
-
-
Vlad-Andrei BĂDOIU (78692) authored
Rewrite training loop in PyTorch See merge request !1
-
- Dec 28, 2023
-
-
Alexandru-Mihai GHERGHESCU authored
Use fastai's fastprogress package to display a progress bar while training, with useful information such as loss, estimated time of training, current learning rate, estimated ms/batch. Print end of epoch stats when finishing an epoch. Add a relevant parameter for the trainer to enable/disable the progress bar display.
-
- Dec 27, 2023
-
-
Alexandru-Mihai GHERGHESCU authored
-
Alexandru-Mihai GHERGHESCU authored
Add an example of what training using the current code would look like. Most of this script can be copied and adapted for other datasets, or for evaluating/testing different Transformer models etc.
-
Alexandru-Mihai GHERGHESCU authored
The model is mostly modeled after the LLama 2 transformer, though it misses a couple of things (grouped-query attention, KV cache for inference, and rotational encodings). These will eventually make it into Optimus code. At that point, the model might as well be called LLoptimus.
-
Alexandru-Mihai GHERGHESCU authored
Add Llama's 32K vocab tokenizer, as well as 2 Optimus variants trained on WikiText103 data: a 32K vocab tokenizer, and a 60K vocab tokenizer. Both Optimus tokenizers are unigram models.
-
Alexandru-Mihai GHERGHESCU authored
Add WikiText103 as an example of what a Dataset needs to look like, for us to be able to use it in the training loop. Other Dataset's can probably directly copy most of the code, and modify small parts of it as needed.
-
Alexandru-Mihai GHERGHESCU authored
Add a few common functions that can be used by whatever dataset we need.
-
Alexandru-Mihai GHERGHESCU authored
Add a training loop, written from scratch. Currently, it is quite bare-bones (trains in FP32, no gradient accumulation, no parallel training etc.), but eventually this will be improved with other must-have things.
-
Alexandru-Mihai GHERGHESCU authored
This is a custom dataloader class, similar to pytorch's DataLoader, but specialized in NLP tasks. Right now, it is pretty much written from scratch, but eventually we want to use the built-in DataLoader, since it has some nice goodies attached to it (like data prefetching/preprocesing, serving data for parallel training etc.).
-
Alexandru-Mihai GHERGHESCU authored
-
Alexandru-Mihai GHERGHESCU authored
-
Alexandru-Mihai GHERGHESCU authored
-
- Nov 24, 2023
-
-
Alexandru-Mihai GHERGHESCU authored
Add parallel training on multiple GPU's through pytorch's DistributedDataParallel (Pytorch DDP).
-
- Nov 08, 2023
-
-
Alexandru-Mihai GHERGHESCU authored
Some changes: - label smoothing - root-mean square norm instead of layer norm - move norm to before layer instead of after, and add final norm layer - remove attention for-loop, instead do a big matrix multiplication - remove bias terms from linear layers - add dropout - remove and rename model parameters (easier to use in code) - add weight tying - add gradient accumulation (change to lower batch size and higher sequence length) - add model checkpoint - add gradient clipping - move warmup steps to 15% of total steps, and change learning rate accordingly - move to floating point 16 bits (fp16); faster training for nvidia GPU's - plot final loss and learning rate scheduling
-