- Jun 10, 2024
-
-
Alexandru-Mihai GHERGHESCU authored
This should now work with any PyTorch model (Optimus is the example given in the source code), as well as any HuggingFace model (adjusted the code to be independent of any model source).
-
- Jun 04, 2024
-
-
Alexandru-Mihai GHERGHESCU authored
-
Alexandru-Mihai GHERGHESCU authored
Add the most basic parallelism to the framework, through PyTorch's DDP. Adjust the dataloaders to also use distributed samplers. Add other goodies for distributed logging + distributed processing.
-
Alexandru-Mihai GHERGHESCU authored
This should make it much easier to control logging levels, when and where to log etc.
-
Alexandru-Mihai GHERGHESCU authored
-
Alexandru-Mihai GHERGHESCU authored
Add Trainer configuration as a separate class. This holds all the training options as a separate dataclass; this can also be easily passed in as a json file. More code organization in the main training loop.
-
Alexandru-Mihai GHERGHESCU authored
Use PyTorch's built-in Dataloader. This should be much easier to work with, and has better support for pre-fetching, memory pinning and other goodies which improve training time.
-
Alexandru-Mihai GHERGHESCU authored
Drop SentencePiece tokenizers, as HuggingFace's tokenizers has a much nicer interface to work with, plus it's written in Rust, is parallelizable, and has better integration with the whole ecosystem. HuggingFace tokenizers should not affect performance at all.
-
- Jun 03, 2024
-
-
Alexandru-Mihai GHERGHESCU authored
This should be much easier to work with, as we don't have to make a separate dataset each time. HuggingFace datasets also has nice functionality which we can use, without loss of performance.
-
Alexandru-Mihai GHERGHESCU authored
Make double quotes for doc-strings and f-strings which contain single quotes inside. Single quotes everywhere else.
-
Alexandru-Mihai GHERGHESCU authored
Gradient (or activation) checkpointing trades compute for memory saved. This should overall make it easier to train large models on not-so-large hardware. Add checkpointing to every layer (same as HuggingFace), as opposed to every 2/3 layers, since 1) this is the easiest to implement, and 2) has the best balance between memory/compute.
-
Alexandru-Mihai GHERGHESCU authored
Add PyTorch's core scaled dot-product attention (SDPA) to Optimus. This automatically uses flash attention 2, or memory efficient attention, if the hardware supports it. If it doesn't, falls back to manual implementation. Training should be much faster with this; memory should also be around half what it was before.
-
Alexandru-Mihai GHERGHESCU authored
-
Alexandru-Mihai GHERGHESCU authored
This should be much nicer to work with, since every option / setting of the model can be controlled through a dataclass; this config can also be created easily from a json file. Set a naming scheme for the Optimus model, similar to HuggingFace models.
-
- Feb 16, 2024
-
-
Vlad-Andrei BĂDOIU (78692) authored
-
- Feb 15, 2024
-
-
Vlad-Andrei BĂDOIU (78692) authored
Add fp16 mixed precision training See merge request !17
-
Alexandru-Mihai GHERGHESCU authored
Pick a better default as epsilon value. Although this value should never touch the fp16 gradients in mixed precision training (as the optimizer should only ever work on the master fp32 copy of the model), this value didn't need to be changed. However, in pure fp16 training, any epsilon value lower than 1e-7 would simply underflow to 0, causing it to become useless. Although the framework doesn't directly support the second case above, an epsilon value of 1e-7 seems like a better default for both AMP and normal training.
-
Alexandru-Mihai GHERGHESCU authored
This should give training a theoretical 2x speedup in time (though in practice that's not usually the case), with close to no loss in performance. The interface allows the user to choose between mixed precision or no mixed precision training, which falls back to normal float32 precision. CPU support for training has been dropped, as it takes (with or without mixed precision) much much longer to train than on GPU's, and it's not really an alternative anyone considers. With the addition of mixed precision, supporting both CPU and GPU would complicate things too much, therefore CPU training support has been dropped.
-
- Jan 30, 2024
-
-
Vlad-Andrei BĂDOIU (78692) authored
Fix estimation interval See merge request !16
-
- Jan 29, 2024
-
-
Vlad-Andrei BĂDOIU (78692) authored
-
Alexandru-Mihai GHERGHESCU authored
Fix a bug where the estimation interval would be 0. This only happened for (very) small datasets, with gradient accumulation steps different than 1.
-
- Jan 28, 2024
-
-
Vlad-Andrei BĂDOIU (78692) authored
Add inference code See merge request !15
-
- Jan 26, 2024
-
-
Alexandru-Mihai GHERGHESCU authored
Output model tokens per second at the end of inference.
-
Alexandru-Mihai GHERGHESCU authored
This allows the inference code to start up with a prompt, instead of waiting for user input from stdin. Allows easier scripting, useful for batch generation, benchmarking etc.
-
- Jan 25, 2024
-
-
Alexandru-Mihai GHERGHESCU authored
-
Alexandru-Mihai GHERGHESCU authored
Inference example code. At the moment, the code simply loads a model state file and generates text using that. Parameters like max sequence length, whether training used fp16, what the tokenizer used for training is etc., need to be passed manually by the user (there's a lot of room for error here). To be improved. Merges changes from !14 Closes !14
-
Vlad-Andrei BĂDOIU (78692) authored
This reverts commit cb893907, reversing changes made to 83f7b518.
-
Vlad-Andrei BĂDOIU (78692) authored
Restructure project See merge request !13
-
Vlad-Andrei BĂDOIU (78692) authored
Add inference code See merge request !10
-
-
-
Reorganize the folder structure to make the project look like an actual library. Move training example outside of framework code.
-
Vlad-Andrei BĂDOIU (78692) authored
Add merge request template See merge request !12
-
Vlad-Andrei BĂDOIU (78692) authored
Fix datasets memory issues See merge request !9
-
Vlad-Andrei BĂDOIU (78692) authored
Fix a number of issues with the infrastructure, no major rework See merge request !11
-
- Jan 24, 2024
-
-
Alexandru-Mihai GHERGHESCU authored
Add a merge request template which aids in contributing to the codebase. Also see https://docs.gitlab.com/ee/user/project/description_templates.html.
-
Alexandru-Mihai GHERGHESCU authored
Visual change, correctly display final training loss. The final training loss didn't account for gradient accumulation, and was therefore much smaller than it should've been in reality. Fix the estimation interval, which was also not properly calculated due to gradient accumulation.
-
Vlad-Andrei BĂDOIU (78692) authored
-
Alexandru-Mihai GHERGHESCU authored
There was a corner case when the shape of the predictions y of the dataset would not be correct, due to the fact that the number of batches was miscalculated. This happened when `batch_len` was exactly divisible by `seq_len`, since the predictions, which are simply the text shifted once to the right, would not have that extra column at the end. Fix the above issue by decrementing the number of available batches with 1 when `batch_len` exactly divides by `seq_len`.
-
Alexandru-Mihai GHERGHESCU authored
Visual change. This only changes what the trainer reports as the final training loss. Not quite sure if the value before was accurate anyway, since gradient accumulation would not let the optimizer step every batch anyway. For a big enough dataset, this should not have any impact at all. The final loss value will be reported based on the last calculation of the loss, correctly taking into consideration gradient accumulation as well.
-