Alexandru-Mihai GHERGHESCU
--- a/scripts/memory_compute_estimations/README.md

+ 5

− 4
+++ b/scripts/memory_compute_estimations/README.md

+ 5

− 4
 @@ -22,7 +22,7 @@ since those use fundamentally different approaches.
 ## Memory requirements

 Memory requirements are given by the script `memory_req.py`. Change the values
-at the top (or use the predefined defaults), run it and get the output. These
+at the top (predefined models in `setups.py`), run it and get the output. These
 assume full 32-bit floating point training (mixed precision will slightly
 decrease the total memory, since some of the activations will be calculated
 using 16-bit floating point; therefore, expect the activations to be slightly
 @@ -66,8 +66,8 @@ cluster](https://lumi-supercomputer.eu/scaling-the-pre-training-of-large-languag
 ## Compute requirements

 Compute requirements for training models can be calculated using the script
-`compute_req.py`. Change the values at the top (or use predefined defaults), run
-it and get the output.
+`compute_req.py`. Change the values at the top (see `setups.py`), run it and get
+the output.

 Notice that total compute is not affected by either batch size or context
 length. Since the model needs to see the whole dataset anyway, it doesn't really
 @@ -92,4 +92,5 @@ represents a small percent of the batch update.
 > too much about adjusting batch size, as gradient accumulation can be used to
 > increase that value without memory overhead. The total number of GPU's should
 > then be adapted in `compute_req.py`, and multiplied by whatever factor for
-> using data-parallel (2x, 3x, 4x etc.), as described above.
+> using data-parallel (2x, 3x, 4x etc.), as described above. If your model is
+> not present in `setups.py`, add it (and also open a pull request :) !).