RoCE multi-node AI training on Red Hat OpenShift

Learn how to run distributed AI training on Red Hat OpenShift using RoCE with this step-by-step guide from manual setup to fully automated training.

Start your AI journey on OpenShift

In this lesson, we’ll explore how to set up storage tailored to your specific OpenShift cluster configuration. Whether you are working with a single node, 2 nodes, or a multi-node environment, you’ll learn how to provision storage for your AI training workloads.

Prerequisites:

  • A running OpenShift cluster (single or multiple nodes).
  • Basic knowledge of OpenShift storage concepts, including PersistentVolumes (PVs) and PersistentVolumeClaims (PVCs).

In this lesson, you will:

  • Explore different storage solutions for distributed AI training on OpenShift.
  • Set up storage tailored to your cluster size (single node, 2 nodes, or 3+ nodes).
  • Deploy storage operators like logical volume manager (LVM), network file system (NFS), or Red Hat OpenShift Data Foundation.

Storage deployment

We need storage to store our dataset/training data. Preferably, we should share the same PVC  across all the training pods so we won't need to have multiple copies of the same data. There are a few options when it comes to storage, depending on the number of nodes we have available:

  • Single node: If we have a single node (SNO) with multiple GPUs, just using an  LVM PVC will do, since with an LVM PVC we can share the dataset across all the pods on the same node.
  • 2 nodes: If we have 2 nodes (SNO+Worker node), we can use the NFS Operator in order to share that PVC on multiple nodes. However, note that the NFS Operator is not officially supported.
  • 3 nodes: If we have 3 or more nodes, the best option is using OpenShift Data Foundation because it also provides us with data redundancy, and the last thing we want to happen is to lose all our training data due to HW failure.

LVM Storage Operator (recommended for single node)

The LVM Storage Operator is the backend for my dataset. When deploying the LVM Operator, you have the option to create an LVM cluster that will automatically provision any unused disk on all nodes, but  you can also create your own LVM config and target specific disks.

Note that the YAML file will require the full disk path. The code block below provides an example that shows how to find an unused disk on a node:

oc get node
NAME                                                STATUS   ROLES                          AGE   VERSION
bb37-h15-000-r750.rdu3.labs.perfscale.redhat.com    Ready    worker                         17d   v1.28.7+f1b5f6c
cc37-h13-000-r750.rdu3.labs.perfscale.redhat.com    Ready    control-plane, master, worker  83d   v1.28.7+f1b5f6c
cc37-h15-000-r750.rdu3.labs.perfscale.redhat.com    Ready    worker                         83d   v1.28.7+f1b5f6c 
oc debug node/bb37-h15-000-r750.rdu3.labs.perfscale.redhat.com 
Temporary namespace openshift-debug-ht426 is created for debugging node... 
Starting pod/bb37-h15-000-r750rdu3labsperfscaleredhat.com-debug-rm4mg... 
To use host binaries, run 'chroot/host'
Pod IP: 10.6.60.147
If you don't see a command prompt, try pressing enter.
chroot/host
lsblk
NAME    MAJ:MIN  RM  SIZE  RO  TYPE MOUNTPOINTS
sda       8:0    0  447.1G  0  disk
I-sdal    8:1    0  1M      0  part
1-sda2    8:2    0  127M    0  part
1-sda3    8:3    0  384M    0  part /boot
‘-sda4    8:4    0  446.6G  0  part /var/lib/kubelet/pods/be7751a2-913-49c7-94ea-3168bd9c910c/volume-subpaths/nvidia-mig-manager-entrypoint/nvidia-mig-manager/0
                                    /var/lib/kubelet/pods/9a43dd25-a0b9-45e7-afcf-ae7174455d5b/volume-subpaths/nvidia-device-plugin-entrypoint/nvidia-device-plugin/0
                                    /var/lib/kubelet/pods/47419dbd-8401-42eb-bf6a-55f4521324e6/volume-subpaths/init-config/init-pod-nvidia-node-status-exporter/1
                                    /var/lib/kubelet/pods/d23309e1-2148-4fd2-88c0-cebb86641561/volume-subpaths/nvidia-container-toolkit-entrypoint/nvidia-container-toolkit-ctr/0
                                    /var
                                    /run/nvidia/driver/etc/hosts
                                    /run/nvidia/driver/host-etc/os-release
                                    /run/nvidia/driver/var/log
                                    /run/nvidia/driver/mnt/shared-nvidia-driver-toolkit
                                    /run/nvidia/driver/dev/termination-log
                                    /sysroot/ostree/deploy/rhcos/var
                                    /sysroot
                                    /usr
                                    /etc
                                    /
sdb       8:16    1   OB     0  disk
sr0       11:0    1   1.1G   0  rom
nvme4n1   259:0   0   2.9T   0  disk
nvme5n1   259:1   0   1.5T   0  disk
nmeOn1    259:2   0   1.5T   0  disk
nvme2n1   259:3   0   1.5T   0  disk
nvme1n1   259:4   0   1.5T   0  disk
nvme3n1   259:5   0   1.5T   0  disk
(reverse-i-search) 'ls -ltr': ls -ltr /dev/disk/by-path


sdb       8:16   1   OB   0  disk
Sr0      11:0   1   1.1   0  rom
nvme4n1 259:0   0   2.9T  0  disk
nvme5n1 259:1   0   1.5T  0  disk
nvme0n1 259:2   0   1.5T  0  disk
nvme2n1 259:3   0   1.5T  0  disk
nvme1n1 259:4   0   1.5T  0  disk
nvme3n1 259:5   0   1.5T  0  disk
ls -ltr /dev/disk/by-path
total 0
1rwxrwxrwx. 1 root root 13 Jun 30 08:07 pci-0000:e6:00.0-nvme-1 -> ../../nvme5n1
1rwxrwxrwx. 1 root root 13 Jun 30 08:07 pci-0000:b3:00.0-nvme-1../../nvme2n1 
1rwxrwxrwx. 1 root root 13 Jun 30 08:07 pci-0000:b2:00.0-nvme-1 -> ../../nvme1n1 
1rwxrwxrwx. 1 root root 13 Jun 30 08:07 pci-0000:b1:00.0-nvme-1 -> ../../nvme@n1 
1rwxrwxrwx. 1 root root 13 Jun 30 08:07 pci-0000:e5:00.0-nvme-1 ../../nvme4n1 
1rwxrwxrwx. 1 root root 13 Jun 30 08:07 pci-0000:b4:00.0-nvme-1 -> ../../nvme3n1
1rwxrwxrwx. 1 root root 9 Jun 30 08:07 pci-0000:05:00.0-ata-1.0 -> ../../sda 
1rwxrwxrwx. 1 root root 9 Jun 30 08:07 pci-0000:05:00.0-ata-1 -> ../../sda 
1rwxrwxrwx. 1 root root 10 Jun 30 08:07 pci-0000:05:00.0-ata-1.0-part1 -> ../../sda1 
1rwxrwxrwx. 1 root root 10 Jun 30 08:07 pci-0000:05:00.0-ata-1-part1 -> ../../sda1
1rwxrwxrwx. 1 root root 10 Jun 30 08:07 pci-0000:05:00.0-ata-1.0-part2 -> ../../sda2 
1rwxrwxrwx. 1 root root 10 Jun 30 08:07 pci-0000:05:00.0-ata-1-part2 -> ../../sda2
1rwxrwxrwx. 1 root root 10 Jun 30 08:07 pci-0000:05:00.0-ata-1.0-part4 -> ../../sda4 
1rwxrwxrwx. 1 root root 10 Jun 30 08:07 pci-0000:05:00.0-ata-1-part4../../sda4 
1rwxrwxrwx. 1 root root 10 Jun 30 08:07 pci-0000:05:00.0-ata-1.0-part3 -> ../../sda3 
1rwxrwxrwx. 1 root root 10 Jun 30 08:07 pci-0000:05:00.0-ata-1-part3 -> ../../sda3 
1rwxrwxrwx. 1 root root 9 Jul 12 22:21 pci-0000:00:14.0-usb-0:14.4.1:1.0-scsi-0:0:0:1-../../sdb
1rwxrwxrwx. 1 root root 9 Jul 12 22:21 pci-0000:00:14.0-usb-0:14.4.1:1.0-scsi-0:0:0:0../../sr0 sh-5.1#

Command line used:

oc get node
oc debug node/bb37-h15-000-r750.rdu3.labs.perfscale.redhat.com
chroot /host
lsblk
ls -ltr /dev/disk/by-path

Now we can add 1 or more disks, as shown below:

        paths:
        - /dev/disk/by-path/pci-0000:e5:00.0-nvme-1

This is the full YAML structure:

apiVersion: lvm.topolvm.io/v1alpha1
kind: LVMCluster
metadata:
  name: resnet50-disk
  namespace: openshift-lvms-operator
spec:
  storage:
    deviceClasses:
    - name: lvms-vg1
      fstype: ext4
      default: false
      deviceSelector:
        paths:
        - /dev/disk/by-path/pci-0000:e5:00.0-nvme-1
        forceWipeDevicesAndDestroyAllData: true
      thinPoolConfig:
        name: thin-pool-1
        sizePercent: 90
        overprovisionRatio: 10
      nodeSelector:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - cc37-h13-000-r750.rdu3.labs.perfscale.redhat.com

OpenShift Data Foundation

Now that we have provisioned a PV on that disk, we can create an NFS volume on top of it. Note that you can also use Red Hat OpenShift Data Foundation if data redundancy is needed.

NFS Operator (2+ nodes)

As previously mentioned, the NFS Operator is not officially supported, but we can use it on top of the OpenShift LVM Operator, in case we only have 2 nodes and can’t spare the storage to hold 2 copies of the data. In cases like that, we can use the NFS Operator to create a disk that can be mounted on all the Pytorch pods. Note that in case NFS is needed, OpenShift Data Foundation supports that as well. See below:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: resnet50-dataset
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 100Gi
  storageClassName: nfs-provisioner

OpenShift Data Foundation (3+ nodes)

In our case, I can use OpenShift Data Foundation, which requires a minimum of 3 hosts. OpenShift Data Foundation is software-defined storage for containers. It helps teams develop and deploy applications quickly and efficiently. The steps below describe how easily I deployed the storage system once the operator was deployed:

  1. Access your OpenShift dashboard.
  2. In the left-hand menu, navigate to the Installed Operators tab.
  3. Select OpenShift Data Foundation.
  4. Click Create StorageSystem. This will initiate a 5-step sequence; proceed through it as described below.
  5. From the Select StorageClass dropdown menu, select local-disk.
  6. Click Next.
  7. Make sure that the checkbox next to Default (OVN) is selected and click Next again.
  8. Click Next.
  9. Click Create StorageSystem.
  10. Navigate to the Storage tab and view the installed operator.

This example shows how I created a volume (PVC) using OpenShift Data Foundation.

Command line used:

oc create -f odf-volume.yaml
oc get sc
NAME                                PROVISIONER                            RECLAIMPOLICY      VOLUMEBINDINGMODE     ALLOWVOLUMEEXPANSION    AGE
Localblock-sc                       kubernetes.io/no-provisioner           Delete             WaitForFirstConsumer  false                   8h
Ocs-storagecluster-ceph-rbd         openshift-storage.rbd.csi.ceph.com.    Delete             Immediate             true                    8h
Ocs-storagecluster-ceph-rgw         openshift-storage.ceph.rook.io/bucket  Delete             Immediate             false                   8h
Ocs-storagecluster-cephfs (default) openshift-storage.cephfs.csi.ceph.com  Delete             Immediate             true                    8h
openshift-storage.noobaa.io         openshift-storage.noobaa.io/obc        Delete             Immediate             false                   8h
vim odf_volume.yaml

Notice that the prompt has changed to bbenshab@bbenshab1-mac pytourch_job % in the above:

apiVersion: v1
kind: PersistentVolumeClaim 
metadata:
  name: workspace 
  namespace: default
spec:
  accessModes:
    - ReadWriteMany
resources:
  requests:
    storage: 1024Gi
storageClassName:

storageClassName: ocs-storagecluster-cephfs

apiVersion: v1
kind: PersistentVolumeClaim 
metadata:
  name: workspace 
  namespace: default
spec:
  accessModes:
    - ReadWriteMany
resources:
  requests:
    storage: 1024Gi
storageClassName: ocs-storagecluster-cephfs
oc get sc
NAME                                   PROVISIONER                           RECLAIMPOLICY       VOLUMEBINDINGMODE       ALLOWVOLUMEEXPANSION      AGE
Localblock-sc                          kubernetes.io/no-provisioner          Delete              WaitForFirstConsumer    false                     8h
Ocs-storagecluster-ceph-rbd            openshift-storage.rbd.csi.ceph.com.   Delete              Immediate               true                      8h
Ocs-storagecluster-ceph-rgw            openshift-storage.ceph.rook.io/bucket Delete              Immediate               false                     8h
ocs-storagecluster-cephfs (default)    openshift-storage.cephfs.csi.ceph.com Delete              Immediate               true                      8h
openshift-storage.noobaa.io            openshift-storage.noobaa.io/obc       Delete              Immediate               false                     8h
vim odf_volume.yaml
oc create -f odf_volume.yaml 
persistentvolumeclaim/workspace created
oc get pvc
NAME         STATUS    VOLUME                                    CAPACITY    ACCESS MODES     STORAGECLASS                AGE
Workspace    Bound     pvc-dcf0a6ff-762c-4628-a38d-e31b31863645  1Ti         RWX              ocs-storagecluster-cephfs   3s

Red Hat OpenShift AI  and OpenShift Mesh Operator

Now we just need to deploy the last 2 operators, which are the Red Hat OpenShift AI and the Red Hat OpenShift Service Mesh operators. For this purpose, we only require the Service Mesh Operator without any additional cluster policy.

Once the Service Mesh Operator has been deployed, we can proceed with the Red Hat OpenShift AI Operator deployment, followed by the DataScienceCluster creation so we can gain access to the Training Operator, which provides us the PytorchJob and KubeFlow APIs. The steps below show how:

  1. Access your OpenShift dashboard.
  2. In the left-hand menu, navigate to the Installed Operators tab.
  3. Select Red Hat OpenShift AI.
  4. Click Create DataScienceCluster.
  5. Unfold the Components dropdown.
  6. Unfold the trainingoperator dropdown.
  7. Set the managementState to Managed.
  8. Click Create.

Or just use this DataScienceCluster YAML file:

kind: DataScienceCluster
apiVersion: datasciencecluster.opendatahub.io/v1
metadata:
  name: default-dsc
  labels:
    app.kubernetes.io/name: datasciencecluster
    app.kubernetes.io/instance: default-dsc
    app.kubernetes.io/part-of: rhods-operator
    app.kubernetes.io/managed-by: kustomize
    app.kubernetes.io/created-by: rhods-operator
spec:
  components:
    codeflare:
      managementState: Managed
    dashboard:
      managementState: Removed
    datasciencepipelines:
      managementState: Removed
    kserve:
      managementState: Removed
      serving:
        ingressGateway:
          certificate:
            type: SelfSigned
        managementState: Managed
        name: knative-serving
    kueue:
      managementState: Managed
    modelmeshserving:
      managementState: Managed
    ray:
      managementState: Managed
    trainingoperator:
      managementState: Managed
    workbenches:
      managementState: Removed

Now that we have successfully deployed the necessary operators and set up the storage environment, let’s move on to running and automating the distributed AI training process on OpenShift.

Previous resource
Training set up and cluster configuration
Next resource
Run distributed AI training on OpenShift