NVIDIA GPU Workloads
Mirantis Kubernetes Engine (MKE) supports running workloads on NVIDIA GPU nodes. Current support is limited to NVIDIA GPUs.
To manage your GPU resources and enable GPU support, MKE installs the NVIDIA GPU Operator on your cluster. The use of this resource causes the following resources to be installed and configured on each node:
Though it is not required, you can run the following command at any point to verify your GPU specifications:
sudo lspci | grep -i nvidia
Example output:
00:1e.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
Configuration
NVIDIA GPU support is disabled in MKE 4 by default.
To enable NVIDIA GPU support:
Obtain the mke4.yaml configuration file:
mkectl init > mke4.yaml
Navigate to the
devicePlugins.nvidiaGPU
section of the mke4.yaml configuration file, and set theenabled
parameter totrue
.devicePlugins: nvidiaGPU: enabled: true
Apply the new configuration setting:
mkectl apply -f mke4.yaml
Verification
Once your NVIDIA GPU support configuration has completed, you can verify your setup using the tests detailed below:
Detect NVIDIA GPU Devices
Run a simple GPU workload that reports detected NVIDIA GPU devices:
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: Pod metadata: name: gpu-pod spec: restartPolicy: Never containers: - name: cuda-container image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0 resources: limits: nvidia.com/gpu: 1 # requesting 1 GPU tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule EOF
Verify the successful completion of the Pod:
kubectl get pods | grep "gpu-pod"
Example output:
NAME READY STATUS RESTARTS AGE gpu-pod 0/1 Completed 0 7m56s
Run a GPU Workload
Run the following command once the Pod has reached Completed
status:
kubectl logs pod/cuda-vectoradd
Example output:
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
Count GPUs
Run the following command once you have enabled the NVIDIA GPU Device Plugin and the Pods have stabilized:
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPUs:.metadata.labels.nvidia\.com/gpu\.count"
Example results, showing a cluster with 3 control-plane nodes and 3 worker nodes:
NAME GPUs
ip-172-31-174-195.us-east-2.compute.internal 1
ip-172-31-228-160.us-east-2.compute.internal <none>
ip-172-31-231-180.us-east-2.compute.internal 1
ip-172-31-26-15.us-east-2.compute.internal <none>
ip-172-31-3-198.us-east-2.compute.internal 1
ip-172-31-99-105.us-east-2.compute.internal <none>