Feature: Separate configurations

The configurations of the framework should be held completely separately from the code itself. Especially, everything related to training (batch size, lr etc.), everything related to the model (layers etc.), everything related to datasets + tokenizers (see llama-recipes, they got the idea right), and everything related to initializing the distributed training (see huggingface's accelerate config.yaml; the idea is to have FSDP/DDP parameters out of the code itself, since it's very hard to follow otherwise).