NetSys
LLaMA Docker images

Repository



Llama 2 UPB (gAIna)

Minimum hardware requirements to run the model
The 7B Llama2 model (the smallest one), works on ~16GB of vRAM and RAM. If RAM
is too small, use a bigger swap (this should only be needed to transfer the
weights onto the GPU, no actual computation is done on the CPU).

How to use
There are a few requirements to get the model to run. Broadly speaking, these
are the actual model (the code), the weights and the Python script to open a
dialog, as well as some Python packages.
A Dockerfile is provided to build an image from scratch using the above. A
Docker image is already built (see
here), so you
can use that instead (you need to be logged in).
Other than that, an Nvidia Container Toolkit driver is necessary to run Nvidia
code on the GPU inside a docker container.

Locally
Steps:

Install Nvidia Container Toolkit (steps for
Ubuntu).
Necessary to let docker containers use the GPU.
Download the Docker container image (docker image pull gitlab.cs.pub.ro:5050/netsys/llama-test:latest).
Run the docker image with docker run -it --gpus all gitlab.cs.pub.ro:5050/netsys/llama-test:latest. This should take a while to
load, but then a prompt to interact with Llama is displayed.


On the UPB cluster (fep)
Steps:

Log in to fep (ssh <username>@fep.grid.pub.ro).
Get a bash shell into a partition with a GPU (srun -p xl --gres gpu:tesla_p100:1 --mem=40G --pty bash).
Pull and build the docker image into an apptainer image on the grid
(apptainer pull docker://gitlab.cs.pub.ro:5050/netsys/llama-test:latest).
This will take a while (probably around ~40 minutes). If it fails because of
a timeout, simply run the same command again.
Run the apptainer image with apptainer run --nv docker://gitlab.cs.pub.ro:5050/netsys/llama-test:latest. The first time it
should take about 3 minutes for it to start, but subsequent runs will take a
few seconds (subsequent run = don't log out).
???
Profit

Note: The script will sometimes still error out because of Out-of-Memory
errors or because the context length was reached. If that happens, reissue the
command to start a new dialog.

Limitations
Currently only tested with 7B Llama2, with a 16GB vRAM GPU (Nvidia P100). The
conversation context length (--max_seq_len parameter of the script) is limited
to 512 tokens (about 2-3 back-and-forth dialogs with the AI). Increasing this
will (almost surely) result in an Out-of-Memory CUDA error.

TODOs


 Choose Python package versions to use inside the Dockerfile, rather than
have them dangling, to prevent compatibility problems.

 Look into quantization (the current model is 8-bit quantized already).

 Better dialog script file.