Page
Training set up and cluster configuration
In this lesson, you'll learn how to set up an OpenShift environment for running distributed AI workloads with PyTorch. This involves deploying several cluster operators, configuring the network for RDMA over Converged Ethernet (RoCE), and setting up GPU resources.
Prerequisites:
- A running OpenShift cluster.
- Basic familiarity with OpenShift operator deployment.
- Access to GPUs for training.
In this lesson, you will:
- Understand the workflow for setting up AI workloads on OpenShift.
- Deploy cluster operators necessary for running distributed training.
- Set up your network for RoCE.
Training flow and hardware diagram
The workflow diagram (Figure 1) highlights all the components we’ll use and how they will work together to execute the training task. The hardware (HW) diagram below breaks down the topology of the infrastructure, illustrating the final form of the cluster (Figure 2).
Cluster operators deployment
In OpenShift, operators can be deployed either through the OperatorHub UI or via YAML files using the CLI. For this setup, it's recommended to follow this order of deployment:
- Apply the tuned profile (if necessary).
- Create the Node Feature Discovery Operator and its
NodeFeatureDiscovery
andNetworkAttachmentDefinition
resources. - Deploy the OpenShift single root input/output virtualization (SR-IOV) Operator, including its
SriovNetworkNodePolicy.
- Deploy the NVIDIA Network Operator and its corresponding
NicClusterPolicy
. - Deploy the NVIDIA GPU Operator and its
NicClusterPolicy
. - Deploy Red Hat OpenShift Service Mesh.
- Deploy Red Hat OpenShift AI.
This deployment order is recommended but not mandatory. Steps 1 and 3 involve machine configuration changes that will necessitate a node reboot. Completing these steps early ensures that node draining and other operations are handled at the start of the cluster’s lifecycle, minimizing the need for large-scale migrations and speeding up these processes initially.
OpenShift has simplified the operator deployment process. Follow the steps below to deploy the NVIDIA GPU Operator:
- Access your OpenShift dashboard.
- In the left-hand menu, navigate to the Operators tab.
- Select OperatorHub.
- On the OperatorHub page, search for the NVIDIA GPU Operator.
- Select the NVIDIA GPU Operator and click Install.
Tuned profile
Depending on the hardware used, you might be experiencing driver crashes within the NVIDIA network operator daemon pod due to its inability to unload the Intel RDMA (irdma
) driver. To resolve this issue, I unloaded the driver using a performance profile.
Below is the tuned performance profile I used to blacklist the irdma
driver as a workaround for the NVIDIA driver issues I encountered:
apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
name: openshift-node-custom
namespace: openshift-cluster-node-tuning-operator
spec:
profile:
- data: |
[main]
summary=Custom OpenShift node profile with irdma module blacklist
include=openshift-node
[bootloader]
cmdline_openshift_node_custom=module_blacklist=irdma
name: openshift-node-custom
recommend:
- machineConfigLabels:
machineconfiguration.openshift.io/role: "master"
priority: 20
profile: openshift-node-custom
Node Feature Discovery Operator
This operator is a tool that manages the detection of hardware features and configuration. It will detect the GPUs you are using as well as other types of hardware.
When using an Infiniband, we need to have the following devices whitelisted (0207
for IB and 02
for network controller):
pci:
deviceClassWhitelist:
- "0200"
- "03"
- "12"
- "02"
- "0207"
This can be done after deployment by editing the NodeFeatureDiscovery
with CLI:
oc edit NodeFeatureDiscovery -n openshift-nfd
It can also be done during/after deployment through the UI (steps described below). This method applies to any operator.
To do this through the UI:
- Access your OpenShift dashboard.
- In the left-hand menu, navigate to the Installed Operators tab.
- Search for the Node Feature Discovery Operator.
- Select the Node Feature Discovery Operator.
- In the menu along the top of the page, navigate to the NodeFeatureDiscovery tab.
- Click the Create NodeFeatureDiscovery button.
- In the YAML file that opens, go to DeviceClassWhiteList and add an additional line -“xxx” at the same indent level as the line above.
- Click Create.
Now we can verify that the IB resource was detected by running:
oc describe node | grep -E 'Roles|pci' | grep pci-15b3
Below is the output example on the cluster:
feature.node.kubernetes.io/pci-15b3.present=true
feature.node.kubernetes.io/pci-15b3.present=true
feature.node.kubernetes.io/pci-15b3.present=true
OpenShift SR-IOV Operator
Upon deployment of the operator, a default SriovNetworkNodePolicy
will be created. In this specific case, no other policy is required.
Looking inside my nodes, I identified my InfiniBand ports, which are ens3f0np0
and ens3f1np1,
and I created the following SR-IOV network node policy for port ens3f0np0
:
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: mlnx-port-1
namespace: openshift-sriov-network-operator
spec:
deviceType: netdevice
isRdma: true
nicSelector:
pfNames:
- ens3f0np0
vendor: 15b3
nodeSelector:
feature.node.kubernetes.io/network-sriov.capable: 'true'
numVfs: 1
priority: 99
resourceName: port1
Now that we have that defined, we can create our attachment definition, which will set our shim dynamic host configuration protocol (DHCP) to service the interface whenever a resource creation requests it (like our PyTorch pods that are about to utilize the RoCE port):
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
annotations:
k8s.v1.cni.cncf.io/resourceName: openshift.io/port1
name: network-port-1
namespace: default
spec:
config: |
{
"cniVersion": "0.3.1",
"name": "network-port-1",
"type": "sriov",
"vlan": 0,
"vlanQoS": 0,
"logLevel": "info",
"ipam": {
"type": "whereabouts",
"range": "192.168.1.2/24",
"exclude": [
"192.168.1.1",
"192.168.1.2",
"192.168.1.254",
"192.168.1.255"
],
"routes": [
{
"dst": "192.168.1.0/24"
}
]
}
}
I created this RoCE generator to create the above config files in case of multiple ports. Note that it will need to be edited in accordance with the cluster so it can be properly used.
Now we run:
oc describe node | grep -E 'Roles|pci' | grep pci-15b3
We should see:
feature.node.kubernetes.io/pci-15b3.present=true
feature.node.kubernetes.io/pci-15b3.sriov.capable=true
feature.node.kubernetes.io/pci-15b3.present=true
feature.node.kubernetes.io/pci-15b3.sriov.capable=true
feature.node.kubernetes.io/pci-15b3.present=true
feature.node.kubernetes.io/pci-15b3.sriov.capable=true
NVIDIA GPU Operator
Note that I have set the driver RDMA argument to be enabled in this cluster policy:
spec:
deviceType: netdevice
isRdma: true
This full cluster policy can be copy-pasted to the operator through the UI, or by applying this YAML file. The deployment will take about 20 minutes to complete. See below:
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: gpu-cluster-policy
spec:
vgpuDeviceManager:
config:
default: default
enabled: true
migManager:
config:
default: all-disabled
name: default-mig-parted-config
enabled: true
operator:
defaultRuntime: crio
initContainer: {}
runtimeClass: nvidia
use_ocp_driver_toolkit: true
dcgm:
enabled: true
gfd:
enabled: true
dcgmExporter:
config:
name: ''
enabled: true
serviceMonitor:
enabled: true
cdi:
default: false
enabled: false
driver:
licensingConfig:
nlsEnabled: true
enabled: true
certConfig:
name: ''
rdma:
enabled: true
repository: nvcr.io/nvidia
useNvidiaDriverCRD: false
kernelModuleConfig:
name: ''
upgradePolicy:
autoUpgrade: true
drain:
deleteEmptyDir: false
enable: false
force: false
timeoutSeconds: 300
maxParallelUpgrades: 1
maxUnavailable: 25%
podDeletion:
deleteEmptyDir: false
force: false
timeoutSeconds: 300
waitForCompletion:
timeoutSeconds: 0
repoConfig: {}
version: 550.54.14
useOpenKernelModules: false
virtualTopology:
config: ''
image: driver
devicePlugin:
config:
default: ''
name: ''
enabled: true
kataManager:
config:
artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses
mig:
strategy: single
sandboxDevicePlugin:
enabled: true
validator:
plugin:
env:
- name: WITH_WORKLOAD
value: 'false'
nodeStatusExporter:
enabled: true
daemonsets:
rollingUpdate:
maxUnavailable: '1'
updateStrategy: RollingUpdate
sandboxWorkloads:
defaultWorkload: container
enabled: false
gds:
enabled: false
vgpuManager:
enabled: false
vfioManager:
enabled: true
toolkit:
enabled: true
installDir: /usr/local/nvidia
NVIDIA Network Operator
Once the operator is deployed, I use this NicClusterPolicy, which tells the NVIDIA Network Operator to automatically configure and manage the networking features, or apply custom network settings. Note that the environment variables that are set in this YAML are there to handle some compatibility issues I experienced. In this specific case, the driver compilement was failing due to a failure to unload the storage module. Also, note that this issue might not occur on every setup. The deployment will take about 20 minutes to complete. See below:
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy
spec:
ofedDriver:
env:
- name: CREATE_IFNAMES_UDEV
value: "true"
- name: UNLOAD_STORAGE_MODULES
value: "true"
image: doca-driver
livenessProbe:
initialDelaySeconds: 30
periodSeconds: 30
readinessProbe:
initialDelaySeconds: 10
periodSeconds: 30
repository: nvcr.io/nvidia/mellanox
startupProbe:
initialDelaySeconds: 10
periodSeconds: 20
terminationGracePeriodSeconds: 300
upgradePolicy:
autoUpgrade: true
drain:
deleteEmptyDir: true
enable: true
force: true
podSelector: ''
timeoutSeconds: 300
maxParallelUpgrades: 1
version: 24.01-0.3.3.1-10
rdmaSharedDevicePlugin:
config: |
{
"configList": [
{
"resourceName": "rdma_shared_device_a",
"rdmaHcaMax": 63,
"selectors": {
"vendors": ["15b3"]
}
}
]
}
image: k8s-rdma-shared-dev-plugin
repository: ghcr.io/mellanox
version: sha-fe7f371c7e1b8315bf900f71cd25cfc1251dc775
Now that we have successfully deployed the required operators, our next focus will be on setting up storage to ensure that the datasets and models can be efficiently accessed and managed during the training process.