Llama 2 UPB (gAIna)
Minimum hardware requirements to run the model
The 7B Llama2 model (the smallest one), works on ~16GB of vRAM and RAM. If RAM is too small, use a bigger swap (this should only be needed to transfer the weights onto the GPU, no actual computation is done on the CPU).
How to use
There are a few requirements to get the model to run. Broadly speaking, these are the actual model (the code), the weights and the Python script to open a dialog, as well as some Python packages.
A Dockerfile is provided to build an image from scratch using the above. A Docker image is already built (see here), so you can use that instead (you need to be logged in).
Other than that, an Nvidia Container Toolkit driver is necessary to run Nvidia code on the GPU inside a docker container.
Locally
Steps:
- Install Nvidia Container Toolkit (steps for Ubuntu). Necessary to let docker containers use the GPU.
- Download the Docker container image (
docker image pull gitlab.cs.pub.ro:5050/netsys/llama-test:latest
). - Run the docker image with
docker run -it --gpus all gitlab.cs.pub.ro:5050/netsys/llama-test:latest
. This should take a while to load, but then a prompt to interact with Llama is displayed.
On the UPB cluster (fep)
Steps:
- Log in to fep (
ssh <username>@fep.grid.pub.ro
). - Get a bash shell into a partition with a GPU (
srun -p xl --gres gpu:tesla_p100:1 --mem=40G --pty bash
). - Pull and build the docker image into an apptainer image on the grid
(
apptainer pull docker://gitlab.cs.pub.ro:5050/netsys/llama-test:latest
). This will take a while (probably around ~40 minutes). If it fails because of a timeout, simply run the same command again. - Run the apptainer image with
apptainer run --nv docker://gitlab.cs.pub.ro:5050/netsys/llama-test:latest
. The first time it should take about 3 minutes for it to start, but subsequent runs will take a few seconds (subsequent run = don't log out). - ???
- Profit
Note: The script will sometimes still error out because of Out-of-Memory errors or because the context length was reached. If that happens, reissue the command to start a new dialog.
Limitations
Currently only tested with 7B Llama2, with a 16GB vRAM GPU (Nvidia P100). The
conversation context length (--max_seq_len
parameter of the script) is limited
to 512 tokens (about 2-3 back-and-forth dialogs with the AI). Increasing this
will (almost surely) result in an Out-of-Memory CUDA error.
TODOs
- Choose Python package versions to use inside the Dockerfile, rather than have them dangling, to prevent compatibility problems.
- Look into quantization (the current model is 8-bit quantized already).
- Better dialog script file.