This blog post guides you through the creation of a Kubernetes cluster with NVIDIA GPU resources. We will use the kubeadm deployment tool to setup the Kubernetes cluster. For the discovery and configuration of nodes with GPU cards, we will integrate GPU Operator. Finally, we will deploy the Ollama application on the cluster and verify that it correctly uses the GPU resources.
This experiment was validated by the intern Frédéric Alleron, student in the Network and Telecom department at the IUT Châtellerault, who completed his internship from June to July 2024 at the LIAS laboratory.
Prerequisites
The hardware prerequisites to reproduce this experiment are:
two Linux machines (my setup: Ubuntu 22.04 LTS), one of which has at least one NVIDIA GPU (my setup: NVIDIA P400 with 2GB). The disk size and memory are not important,
a client machine (my setup: macOS Sonoma) for accessing the Kubernetes cluster.
Since the component GPU Operator, which allows the installation of NVIDIA GPU support on Kubernetes, does not support version 24.04 LTS (platform-support.html), we will limit ourselves to Ubuntu 22.04 LTS.
The three machines are identified in the network as follows:
client machine: 192.140.161.102,
master/worker node (one with GPU): 192.140.161.103 named k8s-gpu-node1,
worker node: 192.140.161.104 named k8s-gpu-node2.
Versions of the used components/tools:
containerd: 1.7.19,
runC: 1.1.13,
kubeadm, kubelet, kubectl: 1.30.3,
helm: 3.15.3,
Cilium: 1.15.6,
Kubenetes: 1.30.3.
For learning about Kubernetes, you can consult my training courses:
This section details the configuration of both nodes prior to setting up the Kubernetes cluster (installation of components and operating system configuration). All operations must be performed identically on all nodes.
Update the repositories and install the latest versions of packages already present on the operating system.
1
2
$ sudo apt-get update
$ sudo apt-get upgrade -y
A Kubernetes cluster requires a container runtime manager on each cluster node. The container runtime manager is responsible for managing the entire lifecycle of containers, including image management, container startup and shutdown. We will use Containerd, which appears to be the most widely used. An incomplete list of various container runtimes is available at https://kubernetes.io/docs/setup/production-environment/container-runtimes.
Download the latest current version of Containerd from GitHub and extract the contents into the directory /usr/local/.
1
2
$ wget https://github.com/containerd/containerd/releases/download/v1.7.19/containerd-1.7.19-linux-amd64.tar.gz
$ sudo tar -C /usr/local -xzvf containerd-1.7.19-linux-amd64.tar.gz
Containerd is associated with a container runtime that interacts directly with the Linux kernel to configure and run containers. We will use runC, which also appears to be widely used.
Download the latest current version of runC from GitHub and install it in the directory /usr/local/sbin.
Four tools will be installed: kubelet, kubeadm, kubectl and helm. The first tool kubelet is responsible for the runtime state on each node, ensuring all containers run within a Pod. The second tool kubeadm handles cluster creation. The third kubectl is a command-line utility for administering the Kubernetes cluster. Finally, helm is a tool used to define, install, and upgrade applications using charts for Kubernetes.
Note that kubectl and helm are client tools and are not necessarily required on cluster nodes. However, they are required on the client machine.
This section shows how to create a Kubernetes cluster using the kubeadm tool. kubeadm is a command-line tool for managing a Kubernetes cluster by installing various components. Only kubelet needs to be installed before, as described in the previous section.
From the master node (k8s-gpu-node1), initialize the Kubernetes cluster.
$ sudo kubeadm init --node-name node-master` --cri-socket /run/containerd/containerd.sock
...
Your Kubernetes control-plane has initialized successfully!
To start using your cluster, you need to run the following as a regular user:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
Alternatively, if you are the root user, you can run:
export KUBECONFIG=/etc/kubernetes/admin.conf
You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
https://kubernetes.io/docs/concepts/cluster-administration/addons/
Then you can join any number of worker nodes by running the following on each as root:
kubeadm join 192.140.161.103:6443 --token YOUR_TOKEN \
--discovery-token-ca-cert-hash sha256:YOUR_CA_CERT_HASH
At the end of the installation, you will be prompted to perform some operations to access the Kubernetes cluster. The first one involves storing the Kubernetes cluster access information in $HOME/.kube/config. This file can be used by kubectl to interact with the cluster. The second operation is to add a node to the Kubernetes cluster.
Still from the client machine, test the communication with the Kubernetes cluster.
1
2
3
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
node-master NotReady control-plane 25h v1.30.3
The master node is currently the only node in the Kubernetes cluster. Additionally, our cluster cannot schedule Pods on this master node for security reasons. Since our cluster may not have many nodes, the security feature will be disabled.
Let’s add a second node to the Kubernetes cluster.
Connect to the worker node k8s-gpu-node2 and execute the following command-line instructions.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
$ sudo kubeadm join 192.140.161.103:6443 --token gf8ui6.ulo4gcme68k7j1zv \
--discovery-token-ca-cert-hash sha256:2563ef8edc1fb9e4bdfdde6c0e723b9812647405be819eff95596eeae0ac254e
[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-check] Waiting for a healthy kubelet. This can take up to 4m0s
[kubelet-check] The kubelet is healthy after 507.199197ms
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap
This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.
Run 'kubectl get nodes' on the control-plane to see this node join the cluster.
The values for the --token and --discovery-token-ca-cert-hash parameters are provided during the creation of the Kubernetes cluster. However, the token value is valid for only 24 hours and you may not have had time to save this information from the console. Don’t worry, both pieces of information can be retrieved from the master node.
To retrieve the token value (--token) if the 24-hour deadline has not been reached.
1
2
3
$ kubeadm token list
TOKEN TTL EXPIRES USAGES DESCRIPTION EXTRA GROUPS
gf8ui6.ulo4gcme68k7j1zv 23h 2024-07-17T16:58:57Z authentication,signing <none> system:bootstrappers:kubeadm:default-node-token
To generate a new value of the token.
1
2
3
4
5
6
$ kubeadm token create
zuku5f.gjtnq2bcupmg0902
$ kubeadm token list
TOKEN TTL EXPIRES USAGES DESCRIPTION EXTRA GROUPS
gf8ui6.ulo4gcme68k7j1zv 23h 2024-07-17T16:58:57Z authentication,signing <none> system:bootstrappers:kubeadm:default-node-token
zuku5f.gjtnq2bcupmg0902 23h 2024-07-17T17:24:39Z authentication,signing <none> system:bootstrappers:kubeadm:default-node-token
To retrieve the value of the certificate authority hash (--discovery-token-ca-cert-hash).
In the previous outputs, both nodes are in NotReady status and the Pods coredns-7db6d8ff4d-ht8wc and coredns-7db6d8ff4d-rlzh5 are not deployed. To resolve this issue, a network plugin compliant with the CNI project must be installed. This will enable Pods to communicate within the Kubernetes cluster.
There are numerous network plugins available, and the choice has been made to use Cilium. Cilium offers the significant advantage of leveraging eBPF (extended Berkeley Packet Filter) technology, which has recently been integrated into Linux kernels. With eBPF, there is no need to load modules into the Linux kernel as was necessary with IPTables.
From the master node, download the latest version of Cilium.
To configure the installation of Cilium, you can use a configuration file. For instance, in the values.yaml file shown below, the CIDR (Classless Inter-Domain Routing) used to assign IPs to the Pods is modified.
From the client machine, verify that both nodes are available and operational.
1
2
3
4
kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-gpu-node2 Ready <none> 20h v1.30.3
node-master Ready control-plane 45h v1.30.3
The status of both nodes must be Ready.
Add GPU support to the Kubernetes cluster
At this stage, a Kubernetes cluster with two nodes is configured. One of the nodes has a GPU card, but the Kubernetes cluster does not know that this node has a GPU. The goal of this section is to declare the GPU in the cluster and identify it as a resource, similar to a CPU or memory resource. However, configuring a GPU in a Kubernetes cluster is not trivial since it requires installing the GPU driver, identifying it with the container manager Containerd, detecting and labeling the nodes with GPUs, and installing specific libraries (such as CUDA). NVIDIA has provided an operator called GPU Operator that simplifies all these tasks. This section aims to detail the installation of this operator.
Create a namespace called gpu-operator. The GPU Operator will be deployed in this namespace.
1
2
$ kubectl create ns gpu-operator
namespace/gpu-operator created
The operator will perform several tasks to discover that an NVIDIA GPU is available on the master node. New labels have been added to the description of the master node.
The previous description shows that a GPU is available: nvidia.com/gpu.count=1, and the detected card is a Quadro-P400 with 2048 MB of memory.
A Pod called cuda-vectoradd based on the image nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0-ubuntu22.04 is deployed to verify that the GPU can be used by a program for GPU computations. Once the computations are completed, the Pod stops.
$ kubectl logs pod/cuda-vectoradd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
The GPU usage by the cuda-vectoradd Pod works perfectly. Let’s now focus on an application that continuously uses the GPU and will demonstrate the utilization of a constrained GPU resource.
Deploying an Application Requiring a GPU
The experimental application used will be Ollama. It is an application that exposes generative AI models, such as LLMs, via a REST API. It is possible to download LLM models and to run them either using only the CPU or by combining the CPU with a GPU to reduce execution time. The outcome of this experiment should demonstrate that if Ollama utilizes a GPU resource, it is preempted and not available for other applications until Ollama releases it. The Ollama application is available through helm.
All the following operations will be performed from the client machine.
Before deploying the Ollama application, check that the GPU resource is available by querying the description of the master node.
$ helm repo add ollama-helm https://otwld.github.io/ollama-helm/
"ollama-helm" has been added to your repositories
Create a ollama namespace to group all resources related to the Ollama application.
1
2
$ kubectl create ns ollama
namespace/ollama created
Deploy Ollama application into the Kubernetes cluster.
1
2
3
4
5
6
7
8
9
10
11
12
$ helm install appli-ollama ollama-helm/ollama --namespace ollama --set ollama.gpu.enabled=true --set ollama.gpu.number=1 --set ollama.gpu.type=nvidia
NAME: appli-ollama
LAST DEPLOYED: Thu Jul 18 09:43:56 2024
NAMESPACE: ollama
STATUS: deployed
REVISION: 1
NOTES:
1. Get the application URL by running these commands:
export POD_NAME=$(kubectl get pods --namespace ollama -l "app.kubernetes.io/name=ollama,app.kubernetes.io/instance=appli-ollama" -o jsonpath="{.items[0].metadata.name}")
export CONTAINER_PORT=$(kubectl get pod --namespace ollama $POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
echo "Visit http://127.0.0.1:8080 to use your application"
kubectl --namespace ollama port-forward $POD_NAME 8080:$CONTAINER_PORT
It specifies that GPU support must be enabled (--set ollama.gpu.enabled=true), the required number of GPUs is one (--set ollama.gpu.number=1) and the GPU type must be NVIDIA (--set ollama.gpu.type=nvidia).
Check that the Ollama application has been deployed (packaged into a Pod) on the master node that has the GPU.
1
2
3
$ kubectl get pods -n ollama -o wide
NAME READY STATUS RESTARTS AGE IP NODE
appli-ollama-8665457c88-gz8ch 1/1 Running 0 2m59s 10.0.0.242 node-master
The Pod (related to the Ollama application) is correctly located on the master node.
In the deployment output for the Ollama application, it is also explained how to use the deployed application via a port-forward. However, this is not the deployment method we will use; instead, we will use a classic NodePort service.
Apply the following service configuration to expose Ollama at the addresses 192.140.161.103:30001 and 192.140.161.104:30001.
Execute the HTTP request to generate a response to the given question.
1
2
3
4
5
6
$ curl http://192.140.161.103:30001/api/generate -d '{
"model": "gemma:2b",
"prompt": "Why is the sky blue?",
"stream": false
}'
{"model":"gemma:2b","created_at":"2024-07-18T08:00:02.461703025Z","response":"The sky appears blue due to Rayleigh scattering. Rayleigh scattering is the scattering of light by small particles, such as molecules in the atmosphere. Blue light has a shorter wavelength than other colors of light, so it is scattered more strongly. This is why the sky appears blue.","done":true,"done_reason":"stop","context":[968,2997,235298,559,235298,15508,235313,1645,108,4385,603,573,8203,3868,181537,615,235298,559,235298,15508,235313,108,235322,2997,235298,559,235298,15508,235313,2516,108,651,8203,8149,3868,3402,577,153902,38497,235265,153902,38497,603,573,38497,576,2611,731,2301,16071,235269,1582,685,24582,575,573,13795,235265,7640,2611,919,476,25270,35571,1178,1156,9276,576,2611,235269,712,665,603,30390,978,16066,235265,1417,603,3165,573,8203,8149,3868,235265],"total_duration":4802224355,"load_duration":33326246,"prompt_eval_count":32,"prompt_eval_duration":324835000,"eval_count":55,"eval_duration":4400550000}
Everything is working correctly, Ollama answers a question and generates a response quickly.
Check the usage of the GPU to see if it has been preempted following the deployment of the Ollama application.
The GPU resources of the Kubernetes cluster are no longer available as they have all been preempted. Thus, if a Pod requiring GPU resources needs to be deployed, the Kubernetes cluster will put it on hold until the GPU resources are freed. To validate this scenario, we will deploy a new instance of Ollama.
Create a ollama2 namespace to group all resources related to the Ollama application.
1
2
$ kubectl create ns ollama2
namespace/ollama2 created
Deploy Ollama application in the Kubernetes cluster within the ollama2 namespace.
1
2
3
4
5
6
7
8
9
10
11
12
$ helm install appli-ollama ollama-helm/ollama --namespace ollama --set ollama.gpu.enabled=true --set ollama.gpu.number=1 --set ollama.gpu.type=nvidia
NAME: appli-ollama
LAST DEPLOYED: Thu Jul 18 09:43:56 2024
NAMESPACE: ollama
STATUS: deployed
REVISION: 1
NOTES:
1. Get the application URL by running these commands:
export POD_NAME=$(kubectl get pods --namespace ollama -l "app.kubernetes.io/name=ollama,app.kubernetes.io/instance=appli-ollama" -o jsonpath="{.items[0].metadata.name}")
export CONTAINER_PORT=$(kubectl get pod --namespace ollama $POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
echo "Visit http://127.0.0.1:8080 to use your application"
kubectl --namespace ollama port-forward $POD_NAME 8080:$CONTAINER_PORT
Display the status of the Pods in the ollama2 namespace.
1
2
3
$ kubectl get pods -n ollama2
NAME READY STATUS RESTARTS AGE
appli-ollama-8665457c88-ngtpf 0/1 Pending 0 116
As expected, the Pod is in the Pending state.
Check the Pod’s description to determine the reason for its Pending state.
1
2
3
4
5
6
$ kubectl describe pod appli-ollama -n ollama2
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 3m24s default-scheduler 0/2 nodes are available: 2 Insufficient nvidia.com/gpu. preemption: 0/2 nodes are available: 2 No preemption victims found for incoming pod.
As indicated in the message, no node in the cluster can accommodate this new Pod.
Conclusion
This experiment showed the setup of a Kubernetes cluster and the discovery of GPU nodes via the GPU Operator.
There are still many aspects to explore, particularly updating the components installed by the GPU Operator (drivers, libraries, etc.). It may also be worthwhile to examine how to manage NVIDIA cards with MIG technology, which aims to partition a GPU into multiple sub-GPUs.
Stay tuned and give your feedbacks in the comments.
Vous pouvez laisser un commentaire en utilisant les Github Issues.
Je suis Mickaël BARON Ingénieur de Recherche en Informatique à l'ISAE-ENSMA et membre du laboratoire LIAS le jour Veilleur Technologique la nuit #Java #Container #VueJS #Services #WebSemantic
Vous pouvez laisser un commentaire en utilisant les Github Issues.