Skip to content
Snippets Groups Projects
Name Last commit Last update
Dockerfile
README.md
dialog.py

Llama 2 UPB (gAIna)

Minimum hardware requirements to run the model

The 7B Llama2 model (the smallest one), works on ~16GB of vRAM and RAM. If RAM is too small, use a bigger swap (this should only be needed to transfer the weights onto the GPU, no actual computation is done on the CPU).

How to use

There are a few requirements to get the model to run. Broadly speaking, these are the actual model (the code), the weights and the Python script to open a dialog, as well as some Python packages.

A Dockerfile is provided to build an image from scratch using the above. A Docker image is already built (see here), so you can use that instead (you need to be logged in).

Other than that, an Nvidia Container Toolkit driver is necessary to run Nvidia code on the GPU inside a docker container.

Locally

Steps:

  1. Install Nvidia Container Toolkit (steps for Ubuntu). Necessary to let docker containers use the GPU.
  2. Download the Docker container image (docker image pull gitlab.cs.pub.ro:5050/netsys/llama-test:latest).
  3. Run the docker image with docker run -it --gpus all gitlab.cs.pub.ro:5050/netsys/llama-test:latest. This should take a while to load, but then a prompt to interact with Llama is displayed.

On the UPB cluster (fep)

Steps:

  1. Log in to fep (ssh <username>@fep.grid.pub.ro).
  2. Get a bash shell into a partition with a GPU (srun -p xl --gres gpu:tesla_p100:1 --mem=40G --pty bash).
  3. Pull and build the docker image into an apptainer image on the grid (apptainer pull docker://gitlab.cs.pub.ro:5050/netsys/llama-test:latest). This will take a while (probably around ~40 minutes). If it fails because of a timeout, simply run the same command again.
  4. Run the apptainer image with apptainer run --nv docker://gitlab.cs.pub.ro:5050/netsys/llama-test:latest. The first time it should take about 3 minutes for it to start, but subsequent runs will take a few seconds (subsequent run = don't log out).
  5. ???
  6. Profit

Note: The script will sometimes still error out because of Out-of-Memory errors or because the context length was reached. If that happens, reissue the command to start a new dialog.

Limitations

Currently only tested with 7B Llama2, with a 16GB vRAM GPU (Nvidia P100). The conversation context length (--max_seq_len parameter of the script) is limited to 512 tokens (about 2-3 back-and-forth dialogs with the AI). Increasing this will (almost surely) result in an Out-of-Memory CUDA error.

TODOs

  • Choose Python package versions to use inside the Dockerfile, rather than have them dangling, to prevent compatibility problems.
  • Look into quantization (the current model is 8-bit quantized already).
  • Better dialog script file.