Serving Llama3 over multi node GPU Kubernetes cluster with auto-scaling
vLLM Serving
With the rising demand for large-scale AI models like Llama3, serving these models efficiently over distributed GPU resources is critical. In this post, we’ll explore how to deploy and manage Llama3 on a multi-node GPU Kubernetes cluster, leveraging auto-scaling to handle dynamic workloads.
What should you have before starting?
Knowledge of Kubernetes, LLM models, GPUs, prometheus, HuggingFace, before reading forward.
A Kubernetes cluster with worker nodes which has Nvidia GPUs. You can quickly create one using AKS or EKS which installs the required Nvidia drivers and plugins for you. Setup prometheus to monitor GPUs(a simple google search will help). My setup consisted of AKS with nodepool of 2 nodes with size Standard_NC96ads_A100_v4 having Nvidia A100GPUs.
Create your account on https://huggingface.co/ and create your access token. Next is to search for llama and apply to get the access of the Model. I have selected meta-llama/Meta-Llama-3-70B-Instruct
.
I have used vLLM to serve the model. What is vLLM?
vLLM is a specialized serving system designed to efficiently deploy and run large language models (LLMs). It focuses on optimizing performance, memory usage, and scalability when serving models like GPT-3, LLaMA, and others in production environments. You can read more about it here.
Let start with creating the YAMLs.
Secret: Create a secret so that the model can be accessed by the vLLM pods. Use the access token which you have created over huggingface.
apiVersion: v1
data:
HUGGINGFACE_TOKEN: <Enter your access token>
kind: Secret
metadata:
name: vllm-huggingface-token
namespace: vllm-nsDeployment: The below vLLM deployment has arguments like
--model: this should be the same model name for which you got the access over huggingface.
--gpu-memory-utilization: GPU memory to be used
--tensor-parallel-size: Number of tensor parallel replicas. Read more tensor-parallelism
There are more options available, to learn more visit the link.
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm
namespace: vllm-ns
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
containers:
- args:
- --model
- meta-llama/Meta-Llama-3-70B-Instruct
- --gpu-memory-utilization
- "1"
- --tensor-parallel-size
- "2"
- --enforce-eager
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
key: HUGGINGFACE_TOKEN
name: vllm-huggingface-token
image: vllm/vllm-openai:latest
name: vllm
volumeMounts:
- mountPath: /root/.cache/huggingface
name: vllm
ports:
- containerPort: 8000
protocol: TCP
resources:
limits:
nvidia.com/gpu: "2"
nodeSelector:
accelerator: nvidia
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
hostIPC: True
volumes:
- name: vllm
persistentVolumeClaim:
claimName: vllmPVC: Create a storage volume
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: vllm
namespace: vllm-ns
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 200Gi
storageClassName: defaultService: Create a service to access the inference. It will also help to load balance the request across multiple vllm pod replicas.
apiVersion: v1
kind: Service
metadata:
name: vllm-service
namespace: vllm-ns
spec:
ports:
- port: 80
protocol: TCP
targetPort: 8000
selector:
app: vllm
sessionAffinity: None
type: ClusterIPHPA(horizontal pod autoscaler): This will monitor the metric and once the usage goes beyond the defined value, it starts creating more pods(maxReplicas).
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm
namespace: vllm-ns
spec:
maxReplicas: 6 # Update this according to your desired number of replicas
minReplicas: 1
scaleTargetRef:
apiVersion: apps/v1beta1
kind: Deployment
name: vllm
metrics:
- type: Pods
pods:
metric:
name: DCGM_FI_DEV_GPU_UTIL
target:
type: AverageValue
averageValue: 30Now the final step is to create a namespace called vllm-ns and apply all the above created YAMLs to the cluster. Check whether all resources are running. Initially there will be only one pod running as defined in the deployment.
Next, you can create a NGINX pod quickly, exec into it and run a curl command as below:
curl vllm-service.vllm-ns/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{ "model": "meta-llama/Meta-Llama-3-70B-Instruct",
"messages": [
{"role": "user", "content": " What is llama3 and how does it help?"}
]
}'You should receive a response, if not check whether the pods is in error state, check the logs of the pods.
To trigger the HPA, you will have to increase the load. For the same you can run the same curl command multiple times. For example use a for loop as given below. You should see the number of pods start increasing once the GPU utilization defined threshold triggers.
for i in {1..20};do curl vllm-service.vllm-ns/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{ "model": "meta-llama/Meta-Llama-3-70B-Instruct",
"messages": [
{"role": "user", "content": " What is llama3 and how does it help?"}
]
}'; done
