Page
Run distributed AI training on OpenShift
In this lesson, you will configure and execute a distributed AI training job on OpenShift using PyTorch. We'll walk through the necessary steps to set up storage permissions, download datasets, create PyTorch jobs, and finally automate the training process across multiple nodes.
In order to get the full benefit from taking this lesson, you need to/prerequisites:
- A running OpenShift cluster with GPUs.
- The required operators (OpenShift AI, Service Mesh) deployed.
- Basic familiarity with Kubernetes, PyTorch, and YAML configurations.
- Storage configurations in place from the previous lessons (LVM, NFS, or OpenShift Data Foundation).
In this lesson, you will:
- Set up permissions for PyTorch pods to access storage.
- Download and prepare the CIFAR-10 dataset for training.
- Create and launch a distributed PyTorch training job.
- Automate the distributed training process across multiple nodes.
Manual training
In this section, we’ll assign the necessary storage access privileges to our PyTorch pods. Below is the ClusterRole definition that grants access to the PersistentVolumes and PersistentVolumeClaims.
Give storage permissions
This is the ClusterRole I used to give the PyTorch pods some needed privileges for getting storage access:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: training-operator-cluster-role
rules:
- apiGroups: [""]
resources: ["persistentvolumes", "persistentvolumeclaims", "services", "endpoints", "events"]
verbs: ["get", "list", "watch", "create", "delete", "update", "patch"]
- apiGroups: ["storage.k8s.io"]
resources: ["storageclasses"]
verbs: ["get", "list", "watch"]
- apiGroups: ["extensions"]
resources: ["podsecuritypolicies"]
verbs: ["use"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: training-operator-cluster-role-binding
subjects:
- kind: ServiceAccount
name: kubeflow-training-operator
namespace: redhat-ods-applications
roleRef:
kind: ClusterRole
name: training-operator-cluster-role
apiGroup: rbac.authorization.k8s.io
Create the dataset
Now that we have all the operators set, we can download the dataset to the PVC that we created earlier. I’m using a pod to download and extract the CIFAR-10 dataset from the University of Toronto to the res50-storage PVC. Once the download is complete, we can delete the pod.
Here is what the pod YAML looks like:
apiVersion: v1
kind: Pod
metadata:
name: dataset-downloader
namespace: pytorch-workspace
spec:
containers:
- name: downloader
image: alpine
command: ["/bin/sh", "-c"]
args:
- |
apk add --no-cache curl && \
mkdir -p /mnt/data && \
cd /mnt/data && \
curl -L -O https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz && \
tar -xvzf cifar-10-python.tar.gz
mv cifar-10-batches-py dataset
mkdir output
volumeMounts:
- mountPath: /mnt/data
name: dataset-storage
restartPolicy: OnFailure
volumes:
- name: dataset-storage
persistentVolumeClaim:
claimName: res50-storage
Create the Pytorch job
The basic PyTorch job example below will launch 3 pods by pulling the premade image I created on Quay.io using this Dockerfile. This image contains PyTorch, the NVIDIA Collective Communication Library (NCCL) library, OpenFabrics Enterprise Distribution (OFED) drivers, and some network tools.
Note that in the YAML below I use annotations to attach the SR-IOV port to the pod:
annotations: &annotations
k8s.v1.cni.cncf.io/networks: |-
[
{
"name": "network-port-1",
"namespace": "default"
}
]
I also point to the SCC I used earlier:
serviceAccountName: pytorch-sa
Here is a full YAML example:
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: pytorch-training
namespace: pytorch-workspace
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: Never
template:
metadata:
annotations: &annotations
k8s.v1.cni.cncf.io/networks: |-
[
{
"name": "network-port-1",
"namespace": "default"
}
]
spec:
serviceAccountName: pytorch-sa
containers:
- name: pytorch
image: path.to.image:latest
imagePullPolicy: Always
env: &env
- name: NCCL_DEBUG
value: "INFO"
- name: NCCL_IGNORE_CPU_AFFINITY
value: "1"
- name: NCCL_CROSS_NIC
value: "0"
- name: NCCL_SOCKET_IFNAME
value: "net1"
- name: NCCL_IB_QPS_PER_CONNECTION
value: "1"
volumeMounts: &vol_mount
- name: storage-volume
mountPath: /mnt/storage
- name: shm-volume
mountPath: /dev/shm
resources: &resource_req
requests:
nvidia.com/gpu: "1"
memory: "10Gi"
cpu: "1"
openshift.io/port1: "1"
limits:
nvidia.com/gpu: "1"
openshift.io/port1: "1"
securityContext: &sec_profile
runAsUser: 0
runAsGroup: 0
capabilities:
add: ["IPC_LOCK", "SYS_RESOURCE", "NET_RAW", "SYS_CHROOT", "AUDIT_WRITE"]
volumes: &vols
- name: storage-volume
persistentVolumeClaim:
claimName: res50-storage
- name: shm-volume
emptyDir:
medium: Memory
sizeLimit: 10Gi
Worker:
replicas: 2
restartPolicy: Never
template:
metadata:
annotations: *annotations
spec:
serviceAccountName: pytorch-sa
containers:
- name: pytorch
image: path.to.image:latest
imagePullPolicy: Always
env: *env
volumeMounts: *vol_mount
resources: *resource_req
securityContext: *sec_profile
volumes: *vols
Now using the above YAML we just need to create the PyTorch Job:
oc create -f pytorchjob.yaml
oc create -f pytorchjob.yaml
pytorchjob.kubeflow.org/pytorch-training created
oc get pod
NAME READY STATUS RESTARTS AGE
pytorch-training-master-0 1/1 Running 0 3s
Pytorch-training-worker-0 0/1 Init: 0/1 0 3s
pytorch-training-worker-1 0/1 Init: 0/1 0 3s
oc get pod
NAME READY STATUS RESTARTS AGE
pytorch-training-master-0 1/1 Running 0 7s
pytorch-training-worker-0 0/1 Init:0/1 0 7s
pytorch-training-worker-1 0/1 Init:0/1 0 7s
oc get pod
NAME READY STATUS RESTARTS AGE
pytorch-training-master-0 1/1 Running 0 11s
Pytorch-training-worker-0 1/1 Running 0 11s
Convert the dataset to images
First, we need to convert the CIFAR-10 dataset to images. I used this Python script to do so:
import os
import pickle
import numpy as np
from PIL import Image
from tqdm import tqdm
def load_cifar_batch(file):
try:
with open(file, 'rb') as fo:
dict = pickle.load(fo, encoding='bytes')
return dict
except FileNotFoundError:
print(f"File {file} not found.")
return None
except Exception as e:
print(f"Error loading file {file}: {e}")
return None
def convert_cifar10_to_imagefolder(root_dir, output_dir):
# Create directories
meta = load_cifar_batch(os.path.join(root_dir, 'batches.meta'))
if meta is None:
return
classes = [label.decode('utf-8') for label in meta[b'label_names']]
for cls in classes:
os.makedirs(os.path.join(output_dir, 'train', cls), exist_ok=True)
os.makedirs(os.path.join(output_dir, 'test', cls), exist_ok=True)
train_batches = [f for f in os.listdir(root_dir) if f.startswith('data_batch_')]
for batch_file in train_batches:
data_dict = load_cifar_batch(os.path.join(root_dir, batch_file))
if data_dict is None:
continue
data = data_dict[b'data']
labels = data_dict[b'labels']
batch_number = batch_file.split('_')[-1]
for i in tqdm(range(len(data)), desc=f'Converting {batch_file}'):
img = data[i]
img = img.reshape(3, 32, 32).transpose(1, 2, 0)
img = Image.fromarray(img)
label = labels[i]
img.save(os.path.join(output_dir, 'train', classes[label], f'batch{batch_number}_img{i}.png'))
# Convert test batch
test_dict = load_cifar_batch(os.path.join(root_dir, 'test_batch'))
if test_dict is None:
return
data = test_dict[b'data']
labels = test_dict[b'labels']
for i in tqdm(range(len(data)), desc='Converting test batch'):
img = data[i]
img = img.reshape(3, 32, 32).transpose(1, 2, 0)
img = Image.fromarray(img)
label = labels[i]
img.save(os.path.join(output_dir, 'test', classes[label], f'test_img{i}.png'))
if __name__ == "__main__":
root_dir = '/mnt/storage/cifar-10-batches-py'
output_dir = '/mnt/storage/dataset/cifar10_imagefolder'
convert_cifar10_to_imagefolder(root_dir, output_dir)
Command line used:
python convert.py
ls -tlr
total 20
-rw-r--r--. 1 root root 8320 Jul 22 18:41 main.py
-rwxr-xr-x 1 root root 1352 Jul 22 19:14 entrypoint.sh
-rw-r--r-- 1 root root 2326 Jul 28 18:00 convert.py
python convert.py
Converting data_batch_4: 100%| | 10000/10000 [00:07<00:00, 1299.51it/s]
Converting data_batch_3: 100%| | 10000/10000 [00:07<00:00, 1295.86it/s]
Converting data_batch_2: 100%| | 10000/10000 [00:07<00:00, 1329.61it/s]
Converting data_batch_5: 100%| | 10000/10000 [00:07<00:00, 1316.59it/s]
Converting data_batch_1: 100%| | 10000/10000 [00:07 00:00, 1321.07it/s]
Converting test batch: 100%| | 10000/10000 [00:07 00:00, 1325.27it/s]
Note that the prompt has changed to root@pytorch-training-master-0:/workspace/training# in the above example.
Launch the training
Now that we have the pods running and the dataset converted to images, we need to find the master pod IB IP address that was assigned through the SR-IOV DHCP CNI. Once we have that address, we can run the Pytorch training script on each of the pods.
Here is a short explanation about the torchrun arguments I used below:
- --nproc_per_node=1: Specifies the number of processes to run per node. In this case, I only have 1 GPU per node.
- --nnodes=3: Indicates the number of nodes (machines) involved in the distributed training.
- --node_rank=0: Specifies the rank (ID) of the current node. In a multi-node setup, this helps identify each node’s role. I used 0 to represent the master node.
- --master_addr=192.168.1.5: The IP address of the master node. This is used by NCCL for the workload distribution.
- --master_port=23456: The port number on the master node used for communication.
Now that the key parameters are explained, you can use the following commands to launch the distributed training job.
First, launch the master pod:
torchrun --nproc_per_node=1 --nnodes=3 --node_rank=0 --master_addr=192.168.1.5 --master_port=23456 main.py --backend=nccl --batch_size=128 --data_path=/mnt/storage/dataset/cifar10_imagefolder --num_train_epochs=1 --learning_rate=0.001 --num_workers=5 --print_interval=5 --output_dir /mnt/storage/
Then, launch worker pod 0:
torchrun --nproc_per_node=1 --nnodes=3 --node_rank=1 --master_addr=192.168.1.5 --master_port=23456 main.py --backend=nccl --batch_size=128 --data_path=/mnt/storage/dataset/cifar10_imagefolder --num_train_epochs=1 --learning_rate=0.001 --num_workers=5 --print_interval=5 --output_dir /mnt/storage/
Finally, launch worker pod 1:
torchrun --nproc_per_node=1 --nnodes=3 --node_rank=2 --master_addr=192.168.1.5 --master_port=23456 main.py --backend=nccl --batch_size=128 --data_path=/mnt/storage/dataset/cifar10_imagefolder --num_train_epochs=1 --learning_rate=0.001 --num_workers=5 --print_interval=5 --output_dir /mnt/storage/
This is what the result looks like (Figure 1). Notice the master IP address, the rank of each node, and the NCCL debug output that shows we are indeed using RoCE.
Automate the training
Now that we have mastered the basic training and understand how all the components work together to execute the training task, we can take it a step further by automating the process.
To achieve this, we will use an entrypoint.sh shell script:
#!/bin/bash
if [[ "$HOSTNAME" == *"master-0"* ]]; then
if [ -f $SHARED_PATH/master_ip ]; then
rm $SHARED_PATH/master_ip
fi
MASTER_IP=$(ip -4 addr show ${NETWORK_INTERFACE} | grep -oP '(?<=inet\s)\d+(\.\d+){3}')
echo "MASTER_IP set to $MASTER_IP (Master Node)"
echo $MASTER_IP > $SHARED_PATH/master_ip
export NODE_NUM=0
else
while [ ! -f $SHARED_PATH/master_ip ]; do
echo "Waiting for master_ip file..."
sleep 1
done
MASTER_IP=$(cat $SHARED_PATH/master_ip)
export NODE_NUM=$(($(hostname | sed 's/[^0-9]*//g') + 1))
echo "MASTER_IP set to $MASTER_IP (Worker Node)"
fi
torchrun --nproc_per_node=${NPROC_PER_NODE} \
--nnodes=${NNODES} \
--node_rank=${NODE_NUM} \
--master_addr=${MASTER_IP} \
--master_port=${MASTER_PORT} \
main.py \
--backend=${BACKEND} \
--batch_size=${BATCH_SIZE} \
--data_path=${DATA_PATH} \
--num_train_epochs=${NUM_TRAIN_EPOCHS} \
--learning_rate=${LEARNING_RATE} \
--num_workers=${NUM_WORKERS} \
--print_interval=${PRINT_INTERVAL} \
--output_dir=${OUTPUT_DIR}
The above script will leverage environment variables assigned through a YAML PyTorch job file in order to execute the task seamlessly.
Note that we are using a simple file to share the master RoCE interface IP address to all the worker pods, and we have to use a shared storage space in order to do so.
This is what the final result looks like (below).
Command line used:
oc create -f pytorchjob_automated.yaml
oc create -f pytorchjob_automated.yaml
pytorchjob.kubeflow.org/pytorch-training created
oc get pod
NAME READY STATUS RESTARTS AGE
pytorch-training-master-0 1/1 Running 0 6s
pytorch-training-worker-0 0/1 Init:0/1 0 6s
pytorch-training-worker-1 0/1 Init:0/1 0 6s
oc get pod
NAME READY STATUS RESTARTS AGE
pytorch-training-master-0 1/1 Running 0 9s
pytorch-training-worker-0 0/1 PodInitializing 0 9s
pytorch-training-worker-0 1/1 Running 0 9s
oc get pod
NAME READY STATUS RESTARTS AGE
pytorch-training-master-0 1/1 Running 0 12s
pytorch-training-worker-0 1/1 Running 0 12s
pytorch-training-worker-1 1/1 Running 0 12s
oc logs pytorch-training-master-0
MASTER_IP set to 192.168.1.4 (Master Node)
Take notice that the prompt here has changed back to bbenshab@bbenshab1-mac pytourch_job %
.
oc get pod
NAME READY STATUS RESTARTS AGE
pytorch-training-master-0 1/1 Running 0 12s
pytorch-training-worker-0 1/1 Running 0 12s
Pytorch-training-worker-1 1/1 Running 0 12s
oc logs pytorch-training-master-0
MASTER IP set to 192.168.1.4 (Master Node)
oc logs pytorch-training-master-0
MASTER IP set to 192.168.1.4 (Master Node)
pytorch-training-master-0:100:100 [0] NCCL INFO NCCL SOCKET_IFNAME set by environment to net1
pytorch-training-master-0:100:100 [0] NCCL INFO Bootstrap Using net1:192.168.1.4<0>
pytorch-training-master-0:100:100 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.21.5+cuda12.4
pytorch-training-master-0:100:170 [0] NCCL INFO Plugin Path: /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
pytorch-training-master-0:100:170 [0] NCCL INFO P2P plugin IBext_v8
pytorch-training-master-0:100:170 [0] NCCL INFO NCCL SOCKET IFNAME set by environment to net1
pytorch-training-master-0:100:170 [0] NCCL INFO NET/IB: Using [0]mlx5_6:1/ROCE [RO]; 00B net1:192.168.1.4<0>
pytorch-training-master-0:100:170 [0] NCCL INFO Using non-device net plugin version 0
pytorch-training-master-0:100:170 [0] NCCL INFO Using network IBext_v8
pytorch-training-master-0:100:170 [0] NCCL INFO ncclCommInitRank comm 0x5595fc9a5080 rank @ nranks 3 cudaDev 0 nvml Dev 0 busid ca000 commId 0xbf91d1dadf65bc69 - Init START
pytorch-training-master-0:100:170 [0] NCCL INFO NCCL IGNORE_CPU_AFFINITY set by environment to 1.
pytorch-training-master-0:100:170 [0] NCCL INFO Setting affinity for GPU 0 to aaaa, aaaaaaaa, aaaaaaaa, aaaaaaaa
pytorch-training-master-0:100:170 [0] NCCL INFO comm. 0x5595fc9a5080 rank 0 nRanks 3 nNodes 3 localRanks 1 localRank MNNVL 0
pytorch-training-master-0:100:170 [0] NCCL INFO Channel 00/02 : 0 1 2
pytorch-training-master-0:100:170 [0] NCCL INFO Channel 01/02 : 0 1 2
pytorch-training-master-0:100:170 [0] NCCL INFO Trees [0] 2/-1/-1-0->-1 [1] 2/-1/-1->0->1
pytorch-training-master-0:100:170 [0] NCCL INFO P2P Chunksize set to 131072
pytorch-training-master-0:100:170 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [receive] via NET/IBext_v8/0
pytorch-training-master-0:100:170 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [receive] via NET/IBext_v8/0
pytorch-training-master-0:100:170 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [send] via NET/IBext_v8/0
pytorch-training-master-0:100:170 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [send] via NET/IBext_v8/0
pytorch-training-master-0:100:172 [0] NCCL INFO NCCL_IB_QPS_PER_CONNECTION set by environment to 1.
pytorch-training-master-0:100:170 [0] NCCL INFO Connected all rings
pytorch-training-master-0:100:170 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [receive] via NET/IBext_v8/0
pytorch-training-master-0:100:170 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [send] via NET/IBext_v8/0
pytorch-training-master-0:100:170 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [send] via NET/IBext_v8/0
pytorch-training-master-0:100:170 [0] NCCL INFO Connected all trees
pytorch-training-master-0:100:170 [0] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
pytorch-training-master-0:100:170 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
pytorch-training-master-0:100:170 [0] NCCL INFO NCCL_CROSS_NIC set by environment to 0.
pytorch-training-master-0:100:170 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTuner Plugin_v2, using internal tuner instead.
pytorch-training-master-0:100:170 [0] NCCL INFO ncclCommInitRank comm 0x5595fc9a5080 rank nranks 3 cudaDev 0 nvmlDev 0 busId ca000 commId 0xbf91d1dadf65bc69 Init COMPLETE epoch, batch_idx, num samples, percentage_completed, loss, time_elapsed, learning_rate, accuracy, gpu_utilization, gpu_memory_usage, cpu_usage, mem_usage
0,0,128,0.26%, 7.184689,1.24,0.001000,0.00, 99.0, 15613.0, 29.27,2768.22
0,5,768,1.54%, 2.276664,2.92,0.001000, 17.19,90.0,15613.0, 33.62,2832.89
pytorch-training-master-0:100:170 [0] NCCL INFO NCCL CROSS_NIC set by environment to 0.
pytorch-training-master-0:100:170 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
pytorch-training-master-0:100:170 [0] NCCL INFO ncclCommInitRank comm 0x5595fc9a5080 rank 0 nranks 3 cudaDev 0 nvml Dev 0 busId ca000 commId 0xbf91d1dadf65bc69 - Init COMPLETE
epoch, batch_idx, num_samples, percentage_completed, loss, time elapsed, learning_rate, accuracy, gpu_utilization, gpu_memory_usage, cpu_usage, mem_usage
0,0,128,0.26%, 7.184689,1.24,0.001000,0.00, 99.0,15613.0, 29.27,2768.22
0,5,768,1.54%, 2.276664,2.92,0.001000, 17.19, 90.0,15613.0,33.62,2832.89
oc logs pytorch-training-master-0
MASTER IP set to 192.168.1.4 (Master Node)
pytorch-training-master-0:100:100 [0] NCCL INFO NCCL_SOCKET IFNAME set by environment to neti
pytorch-training-master-0:100:100 [0] NCCL INFO Bootstrap: Using net1:192.168.1.4<0>
pytorch-training-master-0:100:100 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.21.5+cuda12.4
pytorch-training-master-0:100:170 [0] NCCL INFO Plugin Path: /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
pytorch-training-master-0:100:170 [0] NCCL INFO P2P plugin IBext_v8
pytorch-training-master-0:100:170 [0] NCCL INFO NCCL SOCKET_IFNAME set by environment to net1
pytorch-training-master-0:100:170 [0] NCCL INFO NET/IB: Using [0]mlx5_6:1/ROCE [RO]; 008 net1:192.168.1.4<0>
pytorch-training-master-0:100:170 [0] NCCL INFO Using non-device net plugin version 0
pytorch-training-master-0:100:170 [0] NCCL INFO Using network IBext_v8
pytorch-training-master-0:100:170 [0] NCCL INFO ncclCommInitRank comm 0x5595fc9a5080 rank 0 nranks 3 cudaDev 0 nvmlDev 0 busId ca000 commId 0xbf91d1dadf65bc69 Init START
pytorch-training-master-0:100:170 [0] NCCL INFO NCCL IGNORE_CPU_AFFINITY set by environment to 1.
pytorch-training-master-0:100:170 [0] NCCL INFO Setting affinity for GPU to aaaa, aaaaaaaa, aaaaaaaa,aaaaaaaa
pytorch-training-master-0:100:170 [0] NCCL INFO comm 0x5595fc9a5080 rank 0 nRanks 3 nNodes 3 localRanks 1 local Rank MNNVL 0
pytorch-training-master-0:100:170 [0] NCCL INFO Channel 00/02: 0 1 2
pytorch-training-master-0:100:170 [0] NCCL INFO Channel 01/02 : 0 1 2
pytorch-training-master-0:100:170 [0] NCCL INFO Trees [0] 2/-1/-1-0->-1 [1] 2/-1/-1-0->1
pytorch-training-master-0:100:170 [0] NCCL INFO PZP Chunksize set to 131072
pytorch-training-master-0:100:170 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [receive] via NET/IBext_v8/0
pytorch-training-master-0:100:170 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [receive] via NET/IBext_v8/0
pytorch-training-master-0:100:170 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [send] via NET/IBext_v8/0
pytorch-training-master-0:100:170 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [send] via NET/IBext_v8/0
pytorch-training-master-0:100:172 [0] NCCL INFO NCCL_IB_QPS_PER_CONNECTION set by environment to 1.
pytorch-training-master-0:100:170 [0] NCCL INFO Connected all rings
pytorch-training-master-0:100:170 [0] NCCL INFO Channel 01/0 1[0] -> 0[0] [receive] via NET/IBext_v8/0
pytorch-training-master-0:100:170 [0] NCCL INFO Channel 00/0 0[0] -> 2[0] [send] via NET/IBext_v8/0
pytorch-training-master-0:100:170 [0] NCCL INFO Channel 01/0 0[0] -> 2[0] [send] via NET/IBext_v8/0
pytorch-training-master-0:100:170 [0] NCCL INFO Connected all trees
pytorch-training-master-0:100:170 [0] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
pytorch-training-master-0:100:170 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
pytorch-training-master-0:100:170 [0] NCCL INFO NCCL CROSS_NIC set by environment to 0.
pytorch-training-master-0:100:170 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
pytorch-training-master-0:100:170 [0] NCCL INFO ncclCommInitRank comm 0x5595fc9a5080 rank 0 nranks 3 cudaDev 0 nvmlDev 0 busId ca000 commId 0xbf91d1dadf65bc69 - Init COMPLETE epoch, batch_idx, num_samples, percentage_completed, loss, time_elapsed, learning rate, accuracy, gpu_utilization, gpu_memory_usage, cpu_usage, mem_usage
0,0,128,0.26%, 7.184689,1.24,0.001000,0.00, 99.0,15613.0, 29.27,2768.22
0,5,768,1.54%, 2.276664,2.92,0.001000, 17.19,90.0,15613.0,33.62,2832.89
0,10,1408,2.82%, 2.416407,4.60,0.001000, 20.67,100.0,15613.0,38,0,2763,32
bbenshab@bbenshab1-mac pytourch_job %
Training tuning
The chart below emphasizes the significance of NCCL tuning tailored to both the specific dataset and the hardware in use. These NCCL configurations are applied through environment variables within the training pods.
For this training, I applied the following NCCL settings:
NCCL_ALGO=Tree, NVLSTree
: Great for small image processing.NCCL_IGNORE_CPU_AFFINITY=1
: Disables CPU affinity settings to improve performance.NCCL_CROSS_NIC=0
: Disables usage of multiple network interface controllers (NICs) for NCCL communication (when using more than 1 NIC).NCCL_IB_QPS_PER_CONNECTION=1
: Sets the number of queue pairs per connection to 1 per InfiniBand.
Additionally, I have set the following torchrun
arguments:
backend=nccl
: We have to make sure that NCCL (and not gloo) is used and the required NCCL libraries are installed; this will have a significant impact on performance.batch_size=128
: In my case, with the A30 I could go up to 192, but I found that with 3 nodes, 128 works best.
As shown in Figure 2, the difference between tuned and untuned settings is significant, reducing the training duration by 64 seconds per epoch.
Multi-node training
Figure 3 showcases how long it took for training to complete on 1-3 nodes. The results show that with a single node it took 126 seconds, 64 seconds for 2 nodes, and 44 seconds for 3 nodes, meaning that 2 nodes are not exactly half the time it takes to 1 node to complete the training, and 3 nodes is not exactly ⅓ of the time it takes to complete the training with a single node. In this case, since we are using the CIFAR-10 which is a tiny dataset, 2 things can occur:
- Since we are not keeping the GPUs busy all the time, we utilize the GPU less efficiently, resulting in slower training. In addition, the overhead of synchronizing and managing the data across multiple GPUs can sometimes outweigh the benefits of parallelism when the dataset is too small. This can result in suboptimal GPU utilization and slower training times per epoch compared to using a larger dataset.
- For very small datasets, the advantages of distributing the dataset can be diminished because the time spent on communication and synchronization between nodes may not be justified by the small amount of data processed. Additionally, if the dataset is extremely small, there might not be enough data to distribute evenly, leading to inefficiencies.
Figure 4 illustrates the above statement, showing that with a small dataset, increasing the number of GPUs results in less efficient GPU utilization.
Conclusions
In this learning path, we’ve journeyed through the process of running PyTorch distributed training on OpenShift, and it’s clear that OpenShift is a great tool. Here’s a quick recap of what makes OpenShift such a powerful tool for AI training:
- OpenShift shines with its flexibility. You can easily plug in various tools and frameworks, making it a go-to platform for training any model. Whether you’re working with images, large language models, or other neural networks, OpenShift has got you covered.
- With OpenShift’s operators, managing and deploying AI training jobs becomes a breeze. Setting up your training pipeline is as simple as applying a few YAML files and running a couple of CLI commands. It takes the complexity out of managing training workflows.
- When it comes to resource management, OpenShift really stands out. It helps you efficiently scale your training jobs across multiple nodes and GPUs. This means you can handle more intensive tasks without breaking a sweat, ensuring your training runs smoothly.
In a nutshell, OpenShift offers a robust, user-friendly environment for AI training. Its flexibility, ease of use, and resource management capabilities make it an excellent choice for both new and seasoned data scientists.
Additional resources
NVIDIA GPU Operator:
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.htmlNVIDIA Network Operator:
https://docs.nvidia.com/networking/display/cokan10/network+operator