RoCE multi-node AI training on Red Hat OpenShift

Learn how to run distributed AI training on Red Hat OpenShift using RoCE with this step-by-step guide from manual setup to fully automated training.

Start your AI journey on OpenShift

In this lesson, you will configure and execute a distributed AI training job on OpenShift using PyTorch. We'll walk through the necessary steps to set up storage permissions, download datasets, create PyTorch jobs, and finally automate the training process across multiple nodes.

In order to get the full benefit from taking this lesson, you need to/prerequisites:

  • A running OpenShift cluster with GPUs.
  • The required operators (OpenShift AI, Service Mesh) deployed.
  • Basic familiarity with Kubernetes, PyTorch, and YAML configurations.
  • Storage configurations in place from the previous lessons (LVM, NFS, or OpenShift Data Foundation).

In this lesson, you will:

  • Set up permissions for PyTorch pods to access storage.
  • Download and prepare the CIFAR-10 dataset for training.
  • Create and launch a distributed PyTorch training job.
  • Automate the distributed training process across multiple nodes.

Manual training

In this section, we’ll assign the necessary storage access privileges to our PyTorch pods. Below is the ClusterRole definition that grants access to the PersistentVolumes and PersistentVolumeClaims.

Give storage permissions

This is the ClusterRole I used to give the PyTorch  pods some needed privileges for getting storage access:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: training-operator-cluster-role
rules:
- apiGroups: [""]
  resources: ["persistentvolumes", "persistentvolumeclaims", "services", "endpoints", "events"]
  verbs: ["get", "list", "watch", "create", "delete", "update", "patch"]
- apiGroups: ["storage.k8s.io"]
  resources: ["storageclasses"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["extensions"]
  resources: ["podsecuritypolicies"]
  verbs: ["use"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: training-operator-cluster-role-binding
subjects:
- kind: ServiceAccount
  name: kubeflow-training-operator
  namespace: redhat-ods-applications
roleRef:
  kind: ClusterRole
  name: training-operator-cluster-role
  apiGroup: rbac.authorization.k8s.io

Create the dataset

Now that we have all the operators set, we can download the dataset to the PVC that we created earlier.  I’m using a pod to download and extract the CIFAR-10 dataset from the University of Toronto to the res50-storage PVC. Once the download is complete, we can delete the pod.

Here is what the pod YAML looks like: 

apiVersion: v1
kind: Pod
metadata:
  name: dataset-downloader
  namespace: pytorch-workspace
spec:
  containers:
  - name: downloader
    image: alpine
    command: ["/bin/sh", "-c"]
    args:
      - |
        apk add --no-cache curl && \
        mkdir -p /mnt/data && \
        cd /mnt/data && \
        curl -L -O https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz && \
        tar -xvzf cifar-10-python.tar.gz
        mv cifar-10-batches-py dataset
        mkdir output
    volumeMounts:
    - mountPath: /mnt/data
      name: dataset-storage
  restartPolicy: OnFailure
  volumes:
  - name: dataset-storage
    persistentVolumeClaim:
      claimName: res50-storage

Create the Pytorch job

The basic  PyTorch job example below will launch 3 pods by pulling the premade image I created on Quay.io using this Dockerfile. This image contains PyTorch, the NVIDIA Collective Communication Library (NCCL) library, OpenFabrics Enterprise Distribution (OFED) drivers, and some network tools.

Note that in the YAML below I use annotations to attach the SR-IOV port to the pod:

 annotations: &annotations
            k8s.v1.cni.cncf.io/networks: |-
              [
                {
                  "name": "network-port-1",
                  "namespace": "default"
                }
              ]

I also point to the SCC I used earlier:

          serviceAccountName: pytorch-sa

Here is a full YAML example:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: pytorch-training
  namespace: pytorch-workspace
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          annotations: &annotations
            k8s.v1.cni.cncf.io/networks: |-
              [
                {
                  "name": "network-port-1",
                  "namespace": "default"
                }
              ]
        spec:
          serviceAccountName: pytorch-sa
          containers:
            - name: pytorch
              image: path.to.image:latest
              imagePullPolicy: Always
              env: &env
                - name: NCCL_DEBUG
                  value: "INFO"
                - name: NCCL_IGNORE_CPU_AFFINITY
                  value: "1"
                - name: NCCL_CROSS_NIC
                  value: "0"
                - name: NCCL_SOCKET_IFNAME
                  value: "net1"
                - name: NCCL_IB_QPS_PER_CONNECTION
                  value: "1"
              volumeMounts: &vol_mount
                - name: storage-volume
                  mountPath: /mnt/storage
                - name: shm-volume
                  mountPath: /dev/shm
              resources: &resource_req
                requests:
                  nvidia.com/gpu: "1"
                  memory: "10Gi"
                  cpu: "1"
                  openshift.io/port1: "1"
                limits:
                  nvidia.com/gpu: "1"
                  openshift.io/port1: "1"
              securityContext: &sec_profile
                runAsUser: 0
                runAsGroup: 0
                capabilities:
                  add: ["IPC_LOCK", "SYS_RESOURCE", "NET_RAW", "SYS_CHROOT", "AUDIT_WRITE"]
          volumes: &vols
            - name: storage-volume
              persistentVolumeClaim:
                claimName: res50-storage
            - name: shm-volume
              emptyDir:
                medium: Memory
                sizeLimit: 10Gi
    Worker:
      replicas: 2
      restartPolicy: Never
      template:
        metadata:
          annotations: *annotations
        spec:
          serviceAccountName: pytorch-sa
          containers:
            - name: pytorch
              image: path.to.image:latest
              imagePullPolicy: Always
              env: *env
              volumeMounts: *vol_mount
              resources: *resource_req
              securityContext: *sec_profile
          volumes: *vols

Now using the above YAML we just need to create the PyTorch Job:

oc create -f  pytorchjob.yaml
oc create -f pytorchjob.yaml 
pytorchjob.kubeflow.org/pytorch-training created 
oc get pod
NAME                         READY   STATUS     RESTARTS     AGE
pytorch-training-master-0    1/1     Running    0            3s
Pytorch-training-worker-0    0/1     Init: 0/1  0            3s
pytorch-training-worker-1    0/1     Init: 0/1  0            3s
oc get pod
NAME                         READY   STATUS     RESTARTS     AGE
pytorch-training-master-0    1/1     Running    0            7s
pytorch-training-worker-0    0/1     Init:0/1   0            7s
pytorch-training-worker-1    0/1     Init:0/1   0            7s
oc get pod
NAME                         READY   STATUS     RESTARTS     AGE
pytorch-training-master-0    1/1     Running    0            11s
Pytorch-training-worker-0    1/1     Running    0            11s

Convert the dataset to images

First, we need to convert the CIFAR-10 dataset to images. I used this Python script to do so:

import os
import pickle
import numpy as np
from PIL import Image
from tqdm import tqdm

def load_cifar_batch(file):
    try:
        with open(file, 'rb') as fo:
            dict = pickle.load(fo, encoding='bytes')
        return dict
    except FileNotFoundError:
        print(f"File {file} not found.")
        return None
    except Exception as e:
        print(f"Error loading file {file}: {e}")
        return None

def convert_cifar10_to_imagefolder(root_dir, output_dir):
    # Create directories
    meta = load_cifar_batch(os.path.join(root_dir, 'batches.meta'))
    if meta is None:
        return
    classes = [label.decode('utf-8') for label in meta[b'label_names']]

    for cls in classes:
        os.makedirs(os.path.join(output_dir, 'train', cls), exist_ok=True)
        os.makedirs(os.path.join(output_dir, 'test', cls), exist_ok=True)

    train_batches = [f for f in os.listdir(root_dir) if f.startswith('data_batch_')]

    for batch_file in train_batches:
        data_dict = load_cifar_batch(os.path.join(root_dir, batch_file))
        if data_dict is None:
            continue
        data = data_dict[b'data']
        labels = data_dict[b'labels']

        batch_number = batch_file.split('_')[-1]
        for i in tqdm(range(len(data)), desc=f'Converting {batch_file}'):
            img = data[i]
            img = img.reshape(3, 32, 32).transpose(1, 2, 0)
            img = Image.fromarray(img)
            label = labels[i]
            img.save(os.path.join(output_dir, 'train', classes[label], f'batch{batch_number}_img{i}.png'))

    # Convert test batch
    test_dict = load_cifar_batch(os.path.join(root_dir, 'test_batch'))
    if test_dict is None:
        return
    data = test_dict[b'data']
    labels = test_dict[b'labels']

    for i in tqdm(range(len(data)), desc='Converting test batch'):
        img = data[i]
        img = img.reshape(3, 32, 32).transpose(1, 2, 0)
        img = Image.fromarray(img)
        label = labels[i]
        img.save(os.path.join(output_dir, 'test', classes[label], f'test_img{i}.png'))

if __name__ == "__main__":
    root_dir = '/mnt/storage/cifar-10-batches-py'
    output_dir = '/mnt/storage/dataset/cifar10_imagefolder'
    convert_cifar10_to_imagefolder(root_dir, output_dir)

Command line used:

python convert.py
ls -tlr
total 20
-rw-r--r--. 1 root root 8320 Jul 22 18:41 main.py
-rwxr-xr-x 1 root root 1352 Jul 22 19:14 entrypoint.sh 
-rw-r--r-- 1 root root 2326 Jul 28 18:00 convert.py
 python convert.py
Converting data_batch_4: 100%| | 10000/10000 [00:07<00:00, 1299.51it/s]
Converting data_batch_3: 100%|  | 10000/10000 [00:07<00:00, 1295.86it/s]
Converting data_batch_2: 100%|  | 10000/10000 [00:07<00:00, 1329.61it/s]
Converting data_batch_5: 100%| | 10000/10000 [00:07<00:00, 1316.59it/s]
Converting data_batch_1: 100%| | 10000/10000 [00:07 00:00, 1321.07it/s]
Converting test batch: 100%| | 10000/10000 [00:07 00:00, 1325.27it/s]

Note that the prompt has changed to root@pytorch-training-master-0:/workspace/training# in the above example.

Launch the training

Now that we have the pods running and the dataset converted to images, we need to find the master pod IB IP address that was assigned through the SR-IOV DHCP CNI. Once we have that address, we can run the Pytorch training script on each of the pods.

Here is a short explanation about the torchrun arguments I used below:  

  • --nproc_per_node=1: Specifies the number of processes to run per node. In this case, I only have 1 GPU per node.
  • --nnodes=3: Indicates the number of nodes (machines) involved in the distributed training.
  • --node_rank=0: Specifies the rank (ID) of the current node. In a multi-node setup, this helps identify each node’s role. I used 0 to represent the master node.
  • --master_addr=192.168.1.5: The IP address of the master node. This is used by NCCL for the workload distribution.
  • --master_port=23456: The port number on the master node used for communication.

Now that the key parameters are explained, you can use the following commands to launch the distributed training job.

First, launch the master pod:

torchrun --nproc_per_node=1 --nnodes=3 --node_rank=0 --master_addr=192.168.1.5 --master_port=23456 main.py --backend=nccl --batch_size=128 --data_path=/mnt/storage/dataset/cifar10_imagefolder --num_train_epochs=1 --learning_rate=0.001 --num_workers=5  --print_interval=5 --output_dir /mnt/storage/

Then, launch worker pod 0:

torchrun --nproc_per_node=1 --nnodes=3 --node_rank=1 --master_addr=192.168.1.5 --master_port=23456 main.py --backend=nccl --batch_size=128 --data_path=/mnt/storage/dataset/cifar10_imagefolder --num_train_epochs=1 --learning_rate=0.001 --num_workers=5  --print_interval=5 --output_dir /mnt/storage/

Finally, launch worker pod 1:

torchrun --nproc_per_node=1 --nnodes=3 --node_rank=2 --master_addr=192.168.1.5 --master_port=23456 main.py --backend=nccl --batch_size=128 --data_path=/mnt/storage/dataset/cifar10_imagefolder --num_train_epochs=1 --learning_rate=0.001 --num_workers=5  --print_interval=5 --output_dir /mnt/storage/

This is what the result looks like (Figure 1). Notice the master IP address, the rank of each node, and the NCCL debug output that shows we are indeed using RoCE.

Terminal output showing ongoing distributed PyTorch training across three nodes, with NCCL debug information confirming the use of RoCE.
Figure 1: Ongoing distributed training.

Automate the training

Now that we have mastered the basic training and understand how all the components work together to execute the training task, we can take it a step further by automating the process.

To achieve this, we will use an entrypoint.sh shell script:

#!/bin/bash
if [[ "$HOSTNAME" == *"master-0"* ]]; then
    if [ -f $SHARED_PATH/master_ip ]; then
        rm $SHARED_PATH/master_ip
    fi
    
    MASTER_IP=$(ip -4 addr show ${NETWORK_INTERFACE} | grep -oP '(?<=inet\s)\d+(\.\d+){3}')
    echo "MASTER_IP set to $MASTER_IP (Master Node)"
    echo $MASTER_IP > $SHARED_PATH/master_ip
    export NODE_NUM=0
else
    while [ ! -f $SHARED_PATH/master_ip ]; do
        echo "Waiting for master_ip file..."
        sleep 1
    done
    MASTER_IP=$(cat $SHARED_PATH/master_ip)
    export NODE_NUM=$(($(hostname | sed 's/[^0-9]*//g') + 1))
    echo "MASTER_IP set to $MASTER_IP (Worker Node)"
fi

torchrun --nproc_per_node=${NPROC_PER_NODE} \
        --nnodes=${NNODES} \
        --node_rank=${NODE_NUM} \
        --master_addr=${MASTER_IP} \
        --master_port=${MASTER_PORT} \
        main.py \
        --backend=${BACKEND} \
        --batch_size=${BATCH_SIZE} \
        --data_path=${DATA_PATH} \
        --num_train_epochs=${NUM_TRAIN_EPOCHS} \
        --learning_rate=${LEARNING_RATE} \
        --num_workers=${NUM_WORKERS} \
        --print_interval=${PRINT_INTERVAL} \
--output_dir=${OUTPUT_DIR}

The above script will leverage environment variables assigned through a YAML PyTorch job file in order to execute the task seamlessly.

Note that we are using a simple file to share the master RoCE interface IP address to all the worker pods, and we have to use a shared storage space in order to do so.

This is what the final result looks like (below).

Command line used:

oc create -f pytorchjob_automated.yaml
oc create -f pytorchjob_automated.yaml 
pytorchjob.kubeflow.org/pytorch-training created 
oc get pod
NAME                       READY   STATUS    RESTARTS     AGE
pytorch-training-master-0  1/1     Running   0            6s
pytorch-training-worker-0  0/1     Init:0/1  0            6s
pytorch-training-worker-1  0/1     Init:0/1  0            6s
oc get pod
NAME                       READY   STATUS           RESTARTS     AGE
pytorch-training-master-0  1/1     Running          0            9s
pytorch-training-worker-0  0/1     PodInitializing  0            9s
pytorch-training-worker-0  1/1     Running          0            9s
oc get pod
NAME                       READY   STATUS    RESTARTS     AGE
pytorch-training-master-0  1/1     Running   0            12s
pytorch-training-worker-0  1/1     Running   0            12s
pytorch-training-worker-1  1/1     Running   0            12s
oc logs pytorch-training-master-0
MASTER_IP set to 192.168.1.4 (Master Node)

Take notice that the prompt here has changed back to bbenshab@bbenshab1-mac pytourch_job %.

oc get pod
NAME                       READY  STATUS   RESTARTS  AGE
pytorch-training-master-0  1/1    Running  0         12s
pytorch-training-worker-0  1/1    Running  0         12s
Pytorch-training-worker-1  1/1    Running  0         12s
oc logs pytorch-training-master-0 
MASTER IP set to 192.168.1.4 (Master Node)
oc logs pytorch-training-master-0
MASTER IP set to 192.168.1.4 (Master Node)
pytorch-training-master-0:100:100 [0] NCCL INFO NCCL SOCKET_IFNAME set by environment to net1
pytorch-training-master-0:100:100 [0] NCCL INFO Bootstrap Using net1:192.168.1.4<0>
pytorch-training-master-0:100:100 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.21.5+cuda12.4
pytorch-training-master-0:100:170 [0] NCCL INFO Plugin Path: /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so 
pytorch-training-master-0:100:170 [0] NCCL INFO P2P plugin IBext_v8
pytorch-training-master-0:100:170 [0] NCCL INFO NCCL SOCKET IFNAME set by environment to net1
pytorch-training-master-0:100:170 [0] NCCL INFO NET/IB: Using [0]mlx5_6:1/ROCE [RO]; 00B net1:192.168.1.4<0>
pytorch-training-master-0:100:170 [0] NCCL INFO Using non-device net plugin version 0
pytorch-training-master-0:100:170 [0] NCCL INFO Using network IBext_v8
pytorch-training-master-0:100:170 [0] NCCL INFO ncclCommInitRank comm 0x5595fc9a5080 rank @ nranks 3 cudaDev 0 nvml Dev 0 busid ca000 commId 0xbf91d1dadf65bc69 - Init START
pytorch-training-master-0:100:170 [0] NCCL INFO NCCL IGNORE_CPU_AFFINITY set by environment to 1.
pytorch-training-master-0:100:170 [0] NCCL INFO Setting affinity for GPU 0 to aaaa, aaaaaaaa, aaaaaaaa, aaaaaaaa
pytorch-training-master-0:100:170 [0] NCCL INFO comm. 0x5595fc9a5080 rank 0 nRanks 3 nNodes 3 localRanks 1 localRank MNNVL 0
pytorch-training-master-0:100:170 [0] NCCL INFO Channel 00/02 :  0  1  2
pytorch-training-master-0:100:170 [0] NCCL INFO Channel 01/02 : 0  1  2
pytorch-training-master-0:100:170 [0] NCCL INFO Trees [0] 2/-1/-1-0->-1 [1] 2/-1/-1->0->1
pytorch-training-master-0:100:170 [0] NCCL INFO P2P Chunksize set to 131072
pytorch-training-master-0:100:170 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [receive] via NET/IBext_v8/0 
pytorch-training-master-0:100:170 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [receive] via NET/IBext_v8/0 
pytorch-training-master-0:100:170 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [send] via NET/IBext_v8/0 
pytorch-training-master-0:100:170 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [send] via NET/IBext_v8/0 
pytorch-training-master-0:100:172 [0] NCCL INFO NCCL_IB_QPS_PER_CONNECTION set by environment to 1. 
pytorch-training-master-0:100:170 [0] NCCL INFO Connected all rings
pytorch-training-master-0:100:170 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [receive] via NET/IBext_v8/0
pytorch-training-master-0:100:170 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [send] via NET/IBext_v8/0
pytorch-training-master-0:100:170 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [send] via NET/IBext_v8/0
pytorch-training-master-0:100:170 [0] NCCL INFO Connected all trees
pytorch-training-master-0:100:170 [0] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
pytorch-training-master-0:100:170 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
pytorch-training-master-0:100:170 [0] NCCL INFO NCCL_CROSS_NIC set by environment to 0.
pytorch-training-master-0:100:170 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTuner Plugin_v2, using internal tuner instead.
pytorch-training-master-0:100:170 [0] NCCL INFO ncclCommInitRank comm 0x5595fc9a5080 rank nranks 3 cudaDev 0 nvmlDev 0 busId ca000 commId 0xbf91d1dadf65bc69 Init COMPLETE epoch, batch_idx, num samples, percentage_completed, loss, time_elapsed, learning_rate, accuracy, gpu_utilization, gpu_memory_usage, cpu_usage, mem_usage
0,0,128,0.26%, 7.184689,1.24,0.001000,0.00, 99.0, 15613.0, 29.27,2768.22
0,5,768,1.54%, 2.276664,2.92,0.001000, 17.19,90.0,15613.0, 33.62,2832.89
pytorch-training-master-0:100:170 [0] NCCL INFO NCCL CROSS_NIC set by environment to 0.
pytorch-training-master-0:100:170 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
pytorch-training-master-0:100:170 [0] NCCL INFO ncclCommInitRank comm 0x5595fc9a5080 rank 0 nranks 3 cudaDev 0 nvml Dev 0 busId ca000 commId 0xbf91d1dadf65bc69 - Init COMPLETE 
epoch, batch_idx, num_samples, percentage_completed, loss, time elapsed, learning_rate, accuracy, gpu_utilization, gpu_memory_usage, cpu_usage, mem_usage 
0,0,128,0.26%, 7.184689,1.24,0.001000,0.00, 99.0,15613.0, 29.27,2768.22
0,5,768,1.54%, 2.276664,2.92,0.001000, 17.19, 90.0,15613.0,33.62,2832.89
oc logs pytorch-training-master-0
MASTER IP set to 192.168.1.4 (Master Node)
pytorch-training-master-0:100:100 [0] NCCL INFO NCCL_SOCKET IFNAME set by environment to neti
pytorch-training-master-0:100:100 [0] NCCL INFO Bootstrap: Using net1:192.168.1.4<0>
pytorch-training-master-0:100:100 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.21.5+cuda12.4
pytorch-training-master-0:100:170 [0] NCCL INFO Plugin Path: /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so 
pytorch-training-master-0:100:170 [0] NCCL INFO P2P plugin IBext_v8
pytorch-training-master-0:100:170 [0] NCCL INFO NCCL SOCKET_IFNAME set by environment to net1
pytorch-training-master-0:100:170 [0] NCCL INFO NET/IB: Using [0]mlx5_6:1/ROCE [RO]; 008 net1:192.168.1.4<0>
pytorch-training-master-0:100:170 [0] NCCL INFO Using non-device net plugin version 0
pytorch-training-master-0:100:170 [0] NCCL INFO Using network IBext_v8
pytorch-training-master-0:100:170 [0] NCCL INFO ncclCommInitRank comm 0x5595fc9a5080 rank 0 nranks 3 cudaDev 0 nvmlDev 0 busId ca000 commId 0xbf91d1dadf65bc69 Init START 
pytorch-training-master-0:100:170 [0] NCCL INFO NCCL IGNORE_CPU_AFFINITY set by environment to 1.
pytorch-training-master-0:100:170 [0] NCCL INFO Setting affinity for GPU to aaaa, aaaaaaaa, aaaaaaaa,aaaaaaaa
pytorch-training-master-0:100:170 [0] NCCL INFO comm 0x5595fc9a5080 rank 0 nRanks 3 nNodes 3 localRanks 1 local Rank MNNVL 0
pytorch-training-master-0:100:170 [0] NCCL INFO Channel 00/02: 0  1  2
pytorch-training-master-0:100:170 [0] NCCL INFO Channel 01/02 : 0  1  2
pytorch-training-master-0:100:170 [0] NCCL INFO Trees [0] 2/-1/-1-0->-1 [1] 2/-1/-1-0->1
pytorch-training-master-0:100:170 [0] NCCL INFO PZP Chunksize set to 131072
pytorch-training-master-0:100:170 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [receive] via NET/IBext_v8/0
pytorch-training-master-0:100:170 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [receive] via NET/IBext_v8/0
pytorch-training-master-0:100:170 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [send] via NET/IBext_v8/0
pytorch-training-master-0:100:170 [0] NCCL INFO Channel 01/0 : 0[0] ->  1[0] [send] via NET/IBext_v8/0 
pytorch-training-master-0:100:172 [0] NCCL INFO NCCL_IB_QPS_PER_CONNECTION set by environment to 1. 
pytorch-training-master-0:100:170 [0] NCCL INFO Connected all rings 
pytorch-training-master-0:100:170 [0] NCCL INFO Channel 01/0 1[0] -> 0[0] [receive] via NET/IBext_v8/0 
pytorch-training-master-0:100:170 [0] NCCL INFO Channel 00/0 0[0] -> 2[0] [send] via NET/IBext_v8/0
 pytorch-training-master-0:100:170 [0] NCCL INFO Channel 01/0 0[0] -> 2[0] [send] via NET/IBext_v8/0 
pytorch-training-master-0:100:170 [0] NCCL INFO Connected all trees
pytorch-training-master-0:100:170 [0] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
pytorch-training-master-0:100:170 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
pytorch-training-master-0:100:170 [0] NCCL INFO NCCL CROSS_NIC set by environment to 0.
pytorch-training-master-0:100:170 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
pytorch-training-master-0:100:170 [0] NCCL INFO ncclCommInitRank comm 0x5595fc9a5080 rank 0 nranks 3 cudaDev 0 nvmlDev 0 busId ca000 commId 0xbf91d1dadf65bc69 - Init COMPLETE epoch, batch_idx, num_samples, percentage_completed, loss, time_elapsed, learning rate, accuracy, gpu_utilization, gpu_memory_usage, cpu_usage, mem_usage 
0,0,128,0.26%, 7.184689,1.24,0.001000,0.00, 99.0,15613.0, 29.27,2768.22 
0,5,768,1.54%, 2.276664,2.92,0.001000, 17.19,90.0,15613.0,33.62,2832.89 
0,10,1408,2.82%, 2.416407,4.60,0.001000, 20.67,100.0,15613.0,38,0,2763,32 
bbenshab@bbenshab1-mac pytourch_job %

Training tuning

The chart below emphasizes the significance of NCCL tuning tailored to both the specific dataset and the hardware in use. These NCCL configurations are applied through environment variables within the training pods.

For this training, I applied the following NCCL settings:

  • NCCL_ALGO=Tree, NVLSTree: Great for small image processing.
  • NCCL_IGNORE_CPU_AFFINITY=1: Disables CPU affinity settings to improve performance.
  • NCCL_CROSS_NIC=0: Disables usage of multiple network interface controllers (NICs) for NCCL communication (when using more than 1 NIC).
  • NCCL_IB_QPS_PER_CONNECTION=1: Sets the number of queue pairs per connection to 1 per InfiniBand.

Additionally, I have set the following torchrun arguments:

  • backend=nccl: We have to make sure that NCCL (and not gloo) is used and the required NCCL libraries are installed; this will have a significant impact on performance.
  • batch_size=128: In my case, with the A30 I could go up to 192, but I found that with 3 nodes, 128 works best.

As shown in Figure 2, the difference between tuned and untuned settings is significant, reducing the training duration by 64 seconds per epoch.

A bar chart comparing the training times per epoch with and without NCCL tuning, showing a reduction of 64 seconds per epoch when tuned.
Figure 2: Tuned NCCL vs default settings.

Multi-node training 

Figure 3 showcases how long it took for training to complete on 1-3 nodes. The results show that with a single node it took 126 seconds, 64 seconds for 2 nodes, and 44 seconds for 3 nodes, meaning that 2 nodes are not exactly half the time it takes to 1 node to complete the training, and 3 nodes is not exactly ⅓ of the time it takes to complete the training with a single node. In this case, since we are using the CIFAR-10 which is a tiny dataset, 2 things can occur:

  • Since we are not keeping the GPUs busy all the time, we utilize the GPU less efficiently, resulting in slower training. In addition, the overhead of synchronizing and managing the data across multiple GPUs can sometimes outweigh the benefits of parallelism when the dataset is too small. This can result in suboptimal GPU utilization and slower training times per epoch compared to using a larger dataset.
  •  For very small datasets, the advantages of distributing the dataset can be diminished because the time spent on communication and synchronization between nodes may not be justified by the small amount of data processed. Additionally, if the dataset is extremely small, there might not be enough data to distribute evenly, leading to inefficiencies.
A chart comparing training completion times on one, two, and three nodes. The chart shows faster training as more nodes are added, but also shows diminishing returns when using smaller datasets.
Figure 3: Distributed comparison by the number of nodes.

Figure 4 illustrates the above statement, showing that with a small dataset, increasing the number of GPUs results in less efficient GPU utilization.

A graph showing GPU utilization efficiency when training with a small dataset across multiple nodes. The graph demonstrates less efficient utilization as the number of GPUs increases, and the dataset is too small to be distributed efficiently across nodes.
Figure 4: GPU utilization comparison.

Conclusions

In this learning path, we’ve journeyed through the process of running PyTorch distributed training on OpenShift, and it’s clear that OpenShift is a great tool. Here’s a quick recap of what makes OpenShift such a powerful tool for AI training:

  • OpenShift shines with its flexibility. You can easily plug in various tools and frameworks, making it a go-to platform for training any model. Whether you’re working with images, large language models, or other neural networks, OpenShift has got you covered.
  • With OpenShift’s operators, managing and deploying AI training jobs becomes a breeze. Setting up your training pipeline is as simple as applying a few YAML files and running a couple of CLI commands. It takes the complexity out of managing training workflows.
  • When it comes to resource management, OpenShift really stands out. It helps you efficiently scale your training jobs across multiple nodes and GPUs. This means you can handle more intensive tasks without breaking a sweat, ensuring your training runs smoothly.
    In a nutshell, OpenShift offers a robust, user-friendly environment for AI training. Its flexibility, ease of use, and resource management capabilities make it an excellent choice for both new and seasoned data scientists.

Additional resources

Previous resource
Set up storage for distributed AI training on OpenShift