RoCE multi-node AI training on Red Hat OpenShift

Learn how to run distributed AI training on Red Hat OpenShift using RoCE with this step-by-step guide from manual setup to fully automated training.

Start your AI journey on OpenShift

In this lesson, you'll learn how to set up an OpenShift environment for running distributed AI workloads with PyTorch. This involves deploying several cluster operators, configuring the network for RDMA over Converged Ethernet (RoCE), and setting up GPU resources.

Prerequisites:

  • A running OpenShift cluster.
  • Basic familiarity with OpenShift operator deployment.
  • Access to GPUs for training.

In this lesson, you will:

  • Understand the workflow for setting up AI workloads on OpenShift.
  • Deploy cluster operators necessary for running distributed training.
  • Set up your network for RoCE.

Training flow and hardware diagram

The workflow diagram (Figure 1) highlights all the components we’ll use and how they will work together to execute the training task. The hardware (HW) diagram below breaks down the topology of the infrastructure, illustrating the final form of the cluster (Figure 2).

A diagram showing the workflow of AI training on OpenShift, highlighting the interaction between components like the OpenShift cluster, GPU resources, network setup, and PyTorch training pods.
Figure 1: AI training workflow on OpenShift.
A hardware diagram displaying the infrastructure topology for AI training. It shows the connection between nodes, GPUs, RoCE network, and storage setup in a distributed environment.
Figure 2: AI training infrastructure topology.

Cluster operators deployment 

In OpenShift, operators can be deployed either through the OperatorHub UI or via YAML files using the CLI. For this setup, it's recommended to follow this order of deployment:

  1. Apply the tuned profile (if necessary).
  2. Create the Node Feature Discovery Operator and its NodeFeatureDiscovery and NetworkAttachmentDefinition resources.
  3. Deploy the OpenShift single root input/output virtualization (SR-IOV) Operator, including its SriovNetworkNodePolicy.
  4. Deploy the NVIDIA Network Operator and its corresponding NicClusterPolicy.
  5. Deploy the NVIDIA GPU Operator and its NicClusterPolicy.
  6. Deploy Red Hat OpenShift Service Mesh.
  7. Deploy Red Hat OpenShift AI.

This deployment order is recommended but not mandatory. Steps 1 and 3 involve machine configuration changes that will necessitate a node reboot. Completing these steps early ensures that node draining and other operations are handled at the start of the cluster’s lifecycle, minimizing the need for large-scale migrations and speeding up these processes initially.

OpenShift has simplified the operator deployment process. Follow the steps below to deploy the NVIDIA GPU Operator:

  1. Access your OpenShift dashboard.
  2. In the left-hand menu, navigate to the Operators tab.
  3. Select OperatorHub.
  4. On the OperatorHub page, search for the NVIDIA GPU Operator.
  5. Select the NVIDIA GPU Operator and click Install.

Tuned profile

Depending on the hardware used, you might be experiencing driver crashes within the NVIDIA network operator daemon pod due to its inability to unload the Intel RDMA (irdma) driver. To resolve this issue, I unloaded the driver using a performance profile.

Below is the tuned performance profile I used to blacklist the irdma driver as a workaround for the NVIDIA driver issues I encountered:

apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: openshift-node-custom
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=Custom OpenShift node profile with irdma module blacklist
      include=openshift-node
      [bootloader]
      cmdline_openshift_node_custom=module_blacklist=irdma
    name: openshift-node-custom

  recommend:
  - machineConfigLabels:
      machineconfiguration.openshift.io/role: "master"
    priority: 20
    profile: openshift-node-custom

Node Feature Discovery Operator

This operator is a tool that manages the detection of hardware features and configuration. It will detect the GPUs you are using as well as other types of hardware. 

When using an Infiniband, we need to have the following devices whitelisted (0207 for IB and 02 for network controller):

pci:
        deviceClassWhitelist:
          - "0200"
          - "03"
          - "12"
          - "02"
          - "0207"

This can be done after deployment by editing the NodeFeatureDiscovery with CLI:

oc edit NodeFeatureDiscovery -n openshift-nfd

It can also be done during/after deployment through the UI (steps described below). This method applies to any operator.

To do this through the UI:

  1. Access your OpenShift dashboard.
  2. In the left-hand menu, navigate to the Installed Operators tab.
  3. Search for the Node Feature Discovery Operator.
  4. Select the Node Feature Discovery Operator.
  5. In the menu along the top of the page, navigate to the NodeFeatureDiscovery tab.
  6. Click the Create NodeFeatureDiscovery button.
  7. In the YAML file that opens, go to DeviceClassWhiteList and add an additional line -“xxx” at the same indent level as the line above.
  8. Click Create.

Now we can verify that the IB resource was detected by running:

oc describe node | grep -E 'Roles|pci' | grep pci-15b3

Below is the output example on the cluster:

feature.node.kubernetes.io/pci-15b3.present=true
feature.node.kubernetes.io/pci-15b3.present=true
feature.node.kubernetes.io/pci-15b3.present=true

OpenShift SR-IOV Operator

Upon deployment of the operator, a default SriovNetworkNodePolicy will be created. In this specific case, no other policy is required.

Looking inside my nodes, I identified my InfiniBand ports, which are ens3f0np0 and ens3f1np1, and I created the following SR-IOV network node policy for port ens3f0np0:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: mlnx-port-1
  namespace: openshift-sriov-network-operator
spec:
  deviceType: netdevice
  isRdma: true
  nicSelector:
    pfNames:
    - ens3f0np0
    vendor: 15b3
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: 'true'
  numVfs: 1
  priority: 99
  resourceName: port1

Now that we have that defined, we can create our attachment definition, which will set our shim  dynamic host configuration protocol (DHCP) to service the interface whenever a resource creation requests it (like our PyTorch pods that are about to utilize the RoCE port):

apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  annotations:
    k8s.v1.cni.cncf.io/resourceName: openshift.io/port1
  name: network-port-1
  namespace: default
spec:
  config: |
    {
      "cniVersion": "0.3.1",
      "name": "network-port-1",
      "type": "sriov",
      "vlan": 0,
      "vlanQoS": 0,
      "logLevel": "info",
      "ipam": {
        "type": "whereabouts",
        "range": "192.168.1.2/24",
        "exclude": [
          "192.168.1.1",
          "192.168.1.2",
          "192.168.1.254",
          "192.168.1.255"
        ],
        "routes": [
          {
            "dst": "192.168.1.0/24"
          }
        ]
      }
    }

I created this RoCE generator to create the above config files in case of multiple ports. Note that it will need to be edited in accordance with the cluster so it can be properly used.

Now we run:

oc describe node | grep -E 'Roles|pci' | grep pci-15b3

We should see:

feature.node.kubernetes.io/pci-15b3.present=true
feature.node.kubernetes.io/pci-15b3.sriov.capable=true
feature.node.kubernetes.io/pci-15b3.present=true
feature.node.kubernetes.io/pci-15b3.sriov.capable=true
feature.node.kubernetes.io/pci-15b3.present=true
feature.node.kubernetes.io/pci-15b3.sriov.capable=true

NVIDIA GPU Operator

Note that I have set the driver RDMA argument to be enabled in this cluster policy:

spec:
  deviceType: netdevice
  isRdma: true

This full cluster policy can be copy-pasted to the operator through the UI, or by applying this YAML file. The deployment will take about 20 minutes to complete. See below:

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  vgpuDeviceManager:
    config:
      default: default
    enabled: true
  migManager:
    config:
      default: all-disabled
      name: default-mig-parted-config
    enabled: true
  operator:
    defaultRuntime: crio
    initContainer: {}
    runtimeClass: nvidia
    use_ocp_driver_toolkit: true
  dcgm:
    enabled: true
  gfd:
    enabled: true
  dcgmExporter:
    config:
      name: ''
    enabled: true
    serviceMonitor:
      enabled: true
  cdi:
    default: false
    enabled: false
  driver:
    licensingConfig:
      nlsEnabled: true
    enabled: true
    certConfig:
      name: ''
    rdma:
      enabled: true
    repository: nvcr.io/nvidia
    useNvidiaDriverCRD: false
    kernelModuleConfig:
      name: ''
    upgradePolicy:
      autoUpgrade: true
      drain:
        deleteEmptyDir: false
        enable: false
        force: false
        timeoutSeconds: 300
      maxParallelUpgrades: 1
      maxUnavailable: 25%
      podDeletion:
        deleteEmptyDir: false
        force: false
        timeoutSeconds: 300
      waitForCompletion:
        timeoutSeconds: 0
    repoConfig: {}
    version: 550.54.14
    useOpenKernelModules: false
    virtualTopology:
      config: ''
    image: driver
  devicePlugin:
    config:
      default: ''
      name: ''
    enabled: true
  kataManager:
    config:
      artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses
  mig:
    strategy: single
  sandboxDevicePlugin:
    enabled: true
  validator:
    plugin:
      env:
        - name: WITH_WORKLOAD
          value: 'false'
  nodeStatusExporter:
    enabled: true
  daemonsets:
    rollingUpdate:
      maxUnavailable: '1'
    updateStrategy: RollingUpdate
  sandboxWorkloads:
    defaultWorkload: container
    enabled: false
  gds:
    enabled: false
  vgpuManager:
    enabled: false
  vfioManager:
    enabled: true
  toolkit:
    enabled: true
    installDir: /usr/local/nvidia

NVIDIA Network Operator

Once the operator is deployed, I use this NicClusterPolicy, which tells the NVIDIA Network Operator to automatically configure and manage the networking features, or apply custom network settings. Note that the environment variables that are set in this YAML are there to handle some compatibility issues I experienced. In this specific case, the driver compilement was failing due to a failure to unload the storage module. Also, note that this issue might not occur on every setup. The deployment will take about 20 minutes to complete. See below:

apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  ofedDriver:

    env:
    - name: CREATE_IFNAMES_UDEV
      value: "true"
    - name: UNLOAD_STORAGE_MODULES
      value: "true"
    image: doca-driver
    livenessProbe:
      initialDelaySeconds: 30
      periodSeconds: 30
    readinessProbe:
      initialDelaySeconds: 10
      periodSeconds: 30
    repository: nvcr.io/nvidia/mellanox
    startupProbe:
      initialDelaySeconds: 10
      periodSeconds: 20
    terminationGracePeriodSeconds: 300
    upgradePolicy:
      autoUpgrade: true
      drain:
        deleteEmptyDir: true
        enable: true
        force: true
        podSelector: ''
        timeoutSeconds: 300
      maxParallelUpgrades: 1
    version: 24.01-0.3.3.1-10
  rdmaSharedDevicePlugin:
    config: |
      {
        "configList": [
          {
            "resourceName": "rdma_shared_device_a",
            "rdmaHcaMax": 63,
            "selectors": {
              "vendors": ["15b3"]
            }
          }
        ]
      }
    image: k8s-rdma-shared-dev-plugin
    repository: ghcr.io/mellanox
    version: sha-fe7f371c7e1b8315bf900f71cd25cfc1251dc775

Now that we have successfully deployed the required operators, our next focus will be on setting up storage to ensure that the datasets and models can be efficiently accessed and managed during the training process.

Previous resource
Overview: RoCE multi-node AI training on Red Hat OpenShift
Next resource
Set up storage for distributed AI training on OpenShift