Skip to content
Snippets Groups Projects
Unverified Commit 2d3f8cc8 authored by Alexandru-Mihai GHERGHESCU's avatar Alexandru-Mihai GHERGHESCU
Browse files

Update README (fix typos, formatting)

parent f4abadf8
No related branches found
No related tags found
1 merge request!9Compute/memory requirements scripts
This commit is part of merge request !9. Comments created here will be created in the context of that merge request.
......@@ -8,9 +8,9 @@ which can then be iterated upon.
The estimations are given based on the usual hyperparameters chosen when
training a Transformer-type LLM: number of layers, number of heads, embeddings
dimension, batch size, training context length, vocabulary size. These should
give a rough idea as to the memory requirements for a classical GPT-style
decoder-only Transformer. Other modifications on top of the architecture (such
as SwiGLU activations inside the feedforward layer, multi-query or grouped-query
give a rough idea as to the requirements for a classical GPT-style decoder-only
Transformer. Other modifications on top of the architecture (such as SwiGLU
activations inside the feedforward layer, multi-query or grouped-query
attention, rotary positional embeddings etc.) will likely change the values by a
little bit, however not enough to justify a totally different approach to the
estimations.
......@@ -72,7 +72,7 @@ it and get the output.
Notice that total compute is not affected by either batch size or context
length. Since the model needs to see the whole dataset anyway, it doesn't really
matter how it is partitioned (it doesn't matter whether there are fewer big
chunks, or more large chunks). Batch size and context length will, however,
chunks, or more small chunks). Batch size and context length will, however,
affect memory usage. Context length will also indirectly affect dataset size.
The intuition is that bigger context would need more dataset tokens to be
fully trained. Increasing context length should generally result in increasing
......@@ -81,8 +81,9 @@ scenario).
Be careful about the estimations given low numbers (low dataset size, a model
with a low number of parameters etc.), as communication/software times will
start to matter when the compute needed per step update is low. The GPU's
usually work best when fed big matrices, which keeps them occupied more fully.
start to matter when the compute needed per batch update is low. The GPU's
usually work best when fed big matrices and when network communication only
represents a small percent of the batch update.
# Running the scripts together
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment