Serving Llama3 over multi node GPU Kubernetes cluster with auto-scaling

vLLM Serving

Aug 02, 2024

With the rising demand for large-scale AI models like Llama3, serving these models efficiently over distributed GPU resources is critical. In this post, we’ll explore how to deploy and manage Llama3 on a multi-node GPU Kubernetes cluster, leveraging auto-scaling to handle dynamic workloads.

What should you have before starting?

Knowledge of Kubernetes, LLM models, GPUs, prometheus, HuggingFace, before reading forward.
A Kubernetes cluster with worker nodes which has Nvidia GPUs. You can quickly create one using AKS or EKS which installs the required Nvidia drivers and plugins for you. Setup prometheus to monitor GPUs(a simple google search will help). My setup consisted of AKS with nodepool of 2 nodes with size Standard_NC96ads_A100_v4 having Nvidia A100GPUs.
Create your account on https://huggingface.co/ and create your access token. Next is to search for llama and apply to get the access of the Model. I have selected meta-llama/Meta-Llama-3-70B-Instruct.

I have used vLLM to serve the model. What is vLLM?

vLLM is a specialized serving system designed to efficiently deploy and run large language models (LLMs). It focuses on optimizing performance, memory usage, and scalability when serving models like GPT-3, LLaMA, and others in production environments. You can read more about it here.

Let start with creating the YAMLs.

Secret: Create a secret so that the model can be accessed by the vLLM pods. Use the access token which you have created over huggingface.

apiVersion: v1                                                                                                                                  
data:                                                                                                                                           
  HUGGINGFACE_TOKEN: <Enter your access token>
kind: Secret                                                                                                                                    
metadata: 
  name: vllm-huggingface-token
  namespace: vllm-ns

Deployment: The below vLLM deployment has arguments like

--model: this should be the same model name for which you got the access over huggingface.

--gpu-memory-utilization: GPU memory to be used

--tensor-parallel-size: Number of tensor parallel replicas. Read more tensor-parallelism

There are more options available, to learn more visit the link.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm
  namespace: vllm-ns
spec:
  replicas: 1 
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
      - args:
        - --model
        - meta-llama/Meta-Llama-3-70B-Instruct
        - --gpu-memory-utilization
        - "1"
        - --tensor-parallel-size
        - "2"
        - --enforce-eager
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              key: HUGGINGFACE_TOKEN
              name: vllm-huggingface-token
        image: vllm/vllm-openai:latest
        name: vllm
        volumeMounts:
          - mountPath: /root/.cache/huggingface
            name: vllm
        ports:
        - containerPort: 8000
          protocol: TCP
        resources:
          limits:
            nvidia.com/gpu: "2"
      nodeSelector:
        accelerator: nvidia
      tolerations:
      - effect: NoSchedule
        key: nvidia.com/gpu
        operator: Exists
      hostIPC: True
      volumes:
        - name: vllm
          persistentVolumeClaim:
            claimName: vllm

PVC: Create a storage volume

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vllm
  namespace: vllm-ns
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 200Gi
  storageClassName: default

Service: Create a service to access the inference. It will also help to load balance the request across multiple vllm pod replicas.

apiVersion: v1
kind: Service
metadata:
  name: vllm-service
  namespace: vllm-ns
spec:
  ports:
  - port: 80
    protocol: TCP
    targetPort: 8000
  selector:
    app: vllm
  sessionAffinity: None
  type: ClusterIP

HPA(horizontal pod autoscaler): This will monitor the metric and once the usage goes beyond the defined value, it starts creating more pods(maxReplicas).

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm
  namespace: vllm-ns
spec:
  maxReplicas: 6 # Update this according to your desired number of replicas
  minReplicas: 1
  scaleTargetRef:
    apiVersion: apps/v1beta1
    kind: Deployment
    name: vllm
  metrics:
  - type: Pods
    pods:
      metric:
        name: DCGM_FI_DEV_GPU_UTIL
      target:
        type: AverageValue
        averageValue: 30

Now the final step is to create a namespace called vllm-ns and apply all the above created YAMLs to the cluster. Check whether all resources are running. Initially there will be only one pod running as defined in the deployment.

Next, you can create a NGINX pod quickly, exec into it and run a curl command as below:

curl vllm-service.vllm-ns/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{ "model": "meta-llama/Meta-Llama-3-70B-Instruct",
       "messages": [
          {"role": "user", "content": " What is llama3 and how does it help?"}
          ]
   }'

You should receive a response, if not check whether the pods is in error state, check the logs of the pods.

To trigger the HPA, you will have to increase the load. For the same you can run the same curl command multiple times. For example use a for loop as given below. You should see the number of pods start increasing once the GPU utilization defined threshold triggers.

for i in {1..20};do curl vllm-service.vllm-ns/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{ "model": "meta-llama/Meta-Llama-3-70B-Instruct",
       "messages": [
          {"role": "user", "content": " What is llama3 and how does it help?"}
          ]
   }'; done

Vikarna’s Substack

Discussion about this post

Ready for more?