Page
Set up storage for distributed AI training on OpenShift
In this lesson, we’ll explore how to set up storage tailored to your specific OpenShift cluster configuration. Whether you are working with a single node, 2 nodes, or a multi-node environment, you’ll learn how to provision storage for your AI training workloads.
Prerequisites:
- A running OpenShift cluster (single or multiple nodes).
- Basic knowledge of OpenShift storage concepts, including PersistentVolumes (PVs) and PersistentVolumeClaims (PVCs).
In this lesson, you will:
- Explore different storage solutions for distributed AI training on OpenShift.
- Set up storage tailored to your cluster size (single node, 2 nodes, or 3+ nodes).
- Deploy storage operators like logical volume manager (LVM), network file system (NFS), or Red Hat OpenShift Data Foundation.
Storage deployment
We need storage to store our dataset/training data. Preferably, we should share the same PVC across all the training pods so we won't need to have multiple copies of the same data. There are a few options when it comes to storage, depending on the number of nodes we have available:
- Single node: If we have a single node (SNO) with multiple GPUs, just using an LVM PVC will do, since with an LVM PVC we can share the dataset across all the pods on the same node.
- 2 nodes: If we have 2 nodes (SNO+Worker node), we can use the NFS Operator in order to share that PVC on multiple nodes. However, note that the NFS Operator is not officially supported.
- 3 nodes: If we have 3 or more nodes, the best option is using OpenShift Data Foundation because it also provides us with data redundancy, and the last thing we want to happen is to lose all our training data due to HW failure.
LVM Storage Operator (recommended for single node)
The LVM Storage Operator is the backend for my dataset. When deploying the LVM Operator, you have the option to create an LVM cluster that will automatically provision any unused disk on all nodes, but you can also create your own LVM config and target specific disks.
Note that the YAML file will require the full disk path. The code block below provides an example that shows how to find an unused disk on a node:
oc get node
NAME STATUS ROLES AGE VERSION
bb37-h15-000-r750.rdu3.labs.perfscale.redhat.com Ready worker 17d v1.28.7+f1b5f6c
cc37-h13-000-r750.rdu3.labs.perfscale.redhat.com Ready control-plane, master, worker 83d v1.28.7+f1b5f6c
cc37-h15-000-r750.rdu3.labs.perfscale.redhat.com Ready worker 83d v1.28.7+f1b5f6c
oc debug node/bb37-h15-000-r750.rdu3.labs.perfscale.redhat.com
Temporary namespace openshift-debug-ht426 is created for debugging node...
Starting pod/bb37-h15-000-r750rdu3labsperfscaleredhat.com-debug-rm4mg...
To use host binaries, run 'chroot/host'
Pod IP: 10.6.60.147
If you don't see a command prompt, try pressing enter.
chroot/host
lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 447.1G 0 disk
I-sdal 8:1 0 1M 0 part
1-sda2 8:2 0 127M 0 part
1-sda3 8:3 0 384M 0 part /boot
‘-sda4 8:4 0 446.6G 0 part /var/lib/kubelet/pods/be7751a2-913-49c7-94ea-3168bd9c910c/volume-subpaths/nvidia-mig-manager-entrypoint/nvidia-mig-manager/0
/var/lib/kubelet/pods/9a43dd25-a0b9-45e7-afcf-ae7174455d5b/volume-subpaths/nvidia-device-plugin-entrypoint/nvidia-device-plugin/0
/var/lib/kubelet/pods/47419dbd-8401-42eb-bf6a-55f4521324e6/volume-subpaths/init-config/init-pod-nvidia-node-status-exporter/1
/var/lib/kubelet/pods/d23309e1-2148-4fd2-88c0-cebb86641561/volume-subpaths/nvidia-container-toolkit-entrypoint/nvidia-container-toolkit-ctr/0
/var
/run/nvidia/driver/etc/hosts
/run/nvidia/driver/host-etc/os-release
/run/nvidia/driver/var/log
/run/nvidia/driver/mnt/shared-nvidia-driver-toolkit
/run/nvidia/driver/dev/termination-log
/sysroot/ostree/deploy/rhcos/var
/sysroot
/usr
/etc
/
sdb 8:16 1 OB 0 disk
sr0 11:0 1 1.1G 0 rom
nvme4n1 259:0 0 2.9T 0 disk
nvme5n1 259:1 0 1.5T 0 disk
nmeOn1 259:2 0 1.5T 0 disk
nvme2n1 259:3 0 1.5T 0 disk
nvme1n1 259:4 0 1.5T 0 disk
nvme3n1 259:5 0 1.5T 0 disk
(reverse-i-search) 'ls -ltr': ls -ltr /dev/disk/by-path
sdb 8:16 1 OB 0 disk
Sr0 11:0 1 1.1 0 rom
nvme4n1 259:0 0 2.9T 0 disk
nvme5n1 259:1 0 1.5T 0 disk
nvme0n1 259:2 0 1.5T 0 disk
nvme2n1 259:3 0 1.5T 0 disk
nvme1n1 259:4 0 1.5T 0 disk
nvme3n1 259:5 0 1.5T 0 disk
ls -ltr /dev/disk/by-path
total 0
1rwxrwxrwx. 1 root root 13 Jun 30 08:07 pci-0000:e6:00.0-nvme-1 -> ../../nvme5n1
1rwxrwxrwx. 1 root root 13 Jun 30 08:07 pci-0000:b3:00.0-nvme-1../../nvme2n1
1rwxrwxrwx. 1 root root 13 Jun 30 08:07 pci-0000:b2:00.0-nvme-1 -> ../../nvme1n1
1rwxrwxrwx. 1 root root 13 Jun 30 08:07 pci-0000:b1:00.0-nvme-1 -> ../../nvme@n1
1rwxrwxrwx. 1 root root 13 Jun 30 08:07 pci-0000:e5:00.0-nvme-1 ../../nvme4n1
1rwxrwxrwx. 1 root root 13 Jun 30 08:07 pci-0000:b4:00.0-nvme-1 -> ../../nvme3n1
1rwxrwxrwx. 1 root root 9 Jun 30 08:07 pci-0000:05:00.0-ata-1.0 -> ../../sda
1rwxrwxrwx. 1 root root 9 Jun 30 08:07 pci-0000:05:00.0-ata-1 -> ../../sda
1rwxrwxrwx. 1 root root 10 Jun 30 08:07 pci-0000:05:00.0-ata-1.0-part1 -> ../../sda1
1rwxrwxrwx. 1 root root 10 Jun 30 08:07 pci-0000:05:00.0-ata-1-part1 -> ../../sda1
1rwxrwxrwx. 1 root root 10 Jun 30 08:07 pci-0000:05:00.0-ata-1.0-part2 -> ../../sda2
1rwxrwxrwx. 1 root root 10 Jun 30 08:07 pci-0000:05:00.0-ata-1-part2 -> ../../sda2
1rwxrwxrwx. 1 root root 10 Jun 30 08:07 pci-0000:05:00.0-ata-1.0-part4 -> ../../sda4
1rwxrwxrwx. 1 root root 10 Jun 30 08:07 pci-0000:05:00.0-ata-1-part4../../sda4
1rwxrwxrwx. 1 root root 10 Jun 30 08:07 pci-0000:05:00.0-ata-1.0-part3 -> ../../sda3
1rwxrwxrwx. 1 root root 10 Jun 30 08:07 pci-0000:05:00.0-ata-1-part3 -> ../../sda3
1rwxrwxrwx. 1 root root 9 Jul 12 22:21 pci-0000:00:14.0-usb-0:14.4.1:1.0-scsi-0:0:0:1-../../sdb
1rwxrwxrwx. 1 root root 9 Jul 12 22:21 pci-0000:00:14.0-usb-0:14.4.1:1.0-scsi-0:0:0:0../../sr0 sh-5.1#
Command line used:
oc get node
oc debug node/bb37-h15-000-r750.rdu3.labs.perfscale.redhat.com
chroot /host
lsblk
ls -ltr /dev/disk/by-path
Now we can add 1 or more disks, as shown below:
paths:
- /dev/disk/by-path/pci-0000:e5:00.0-nvme-1
This is the full YAML structure:
apiVersion: lvm.topolvm.io/v1alpha1
kind: LVMCluster
metadata:
name: resnet50-disk
namespace: openshift-lvms-operator
spec:
storage:
deviceClasses:
- name: lvms-vg1
fstype: ext4
default: false
deviceSelector:
paths:
- /dev/disk/by-path/pci-0000:e5:00.0-nvme-1
forceWipeDevicesAndDestroyAllData: true
thinPoolConfig:
name: thin-pool-1
sizePercent: 90
overprovisionRatio: 10
nodeSelector:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- cc37-h13-000-r750.rdu3.labs.perfscale.redhat.com
OpenShift Data Foundation
Now that we have provisioned a PV on that disk, we can create an NFS volume on top of it. Note that you can also use Red Hat OpenShift Data Foundation if data redundancy is needed.
NFS Operator (2+ nodes)
As previously mentioned, the NFS Operator is not officially supported, but we can use it on top of the OpenShift LVM Operator, in case we only have 2 nodes and can’t spare the storage to hold 2 copies of the data. In cases like that, we can use the NFS Operator to create a disk that can be mounted on all the Pytorch pods. Note that in case NFS is needed, OpenShift Data Foundation supports that as well. See below:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: resnet50-dataset
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 100Gi
storageClassName: nfs-provisioner
OpenShift Data Foundation (3+ nodes)
In our case, I can use OpenShift Data Foundation, which requires a minimum of 3 hosts. OpenShift Data Foundation is software-defined storage for containers. It helps teams develop and deploy applications quickly and efficiently. The steps below describe how easily I deployed the storage system once the operator was deployed:
- Access your OpenShift dashboard.
- In the left-hand menu, navigate to the Installed Operators tab.
- Select OpenShift Data Foundation.
- Click Create StorageSystem. This will initiate a 5-step sequence; proceed through it as described below.
- From the Select StorageClass dropdown menu, select local-disk.
- Click Next.
- Make sure that the checkbox next to Default (OVN) is selected and click Next again.
- Click Next.
- Click Create StorageSystem.
- Navigate to the Storage tab and view the installed operator.
This example shows how I created a volume (PVC) using OpenShift Data Foundation.
Command line used:
oc create -f odf-volume.yaml
oc get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
Localblock-sc kubernetes.io/no-provisioner Delete WaitForFirstConsumer false 8h
Ocs-storagecluster-ceph-rbd openshift-storage.rbd.csi.ceph.com. Delete Immediate true 8h
Ocs-storagecluster-ceph-rgw openshift-storage.ceph.rook.io/bucket Delete Immediate false 8h
Ocs-storagecluster-cephfs (default) openshift-storage.cephfs.csi.ceph.com Delete Immediate true 8h
openshift-storage.noobaa.io openshift-storage.noobaa.io/obc Delete Immediate false 8h
vim odf_volume.yaml
Notice that the prompt has changed to bbenshab@bbenshab1-mac pytourch_job % in the above:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: workspace
namespace: default
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1024Gi
storageClassName:
storageClassName: ocs-storagecluster-cephfs
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: workspace
namespace: default
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1024Gi
storageClassName: ocs-storagecluster-cephfs
oc get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
Localblock-sc kubernetes.io/no-provisioner Delete WaitForFirstConsumer false 8h
Ocs-storagecluster-ceph-rbd openshift-storage.rbd.csi.ceph.com. Delete Immediate true 8h
Ocs-storagecluster-ceph-rgw openshift-storage.ceph.rook.io/bucket Delete Immediate false 8h
ocs-storagecluster-cephfs (default) openshift-storage.cephfs.csi.ceph.com Delete Immediate true 8h
openshift-storage.noobaa.io openshift-storage.noobaa.io/obc Delete Immediate false 8h
vim odf_volume.yaml
oc create -f odf_volume.yaml
persistentvolumeclaim/workspace created
oc get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
Workspace Bound pvc-dcf0a6ff-762c-4628-a38d-e31b31863645 1Ti RWX ocs-storagecluster-cephfs 3s
Red Hat OpenShift AI and OpenShift Mesh Operator
Now we just need to deploy the last 2 operators, which are the Red Hat OpenShift AI and the Red Hat OpenShift Service Mesh operators. For this purpose, we only require the Service Mesh Operator without any additional cluster policy.
Once the Service Mesh Operator has been deployed, we can proceed with the Red Hat OpenShift AI Operator deployment, followed by the DataScienceCluster creation so we can gain access to the Training Operator, which provides us the PytorchJob and KubeFlow APIs. The steps below show how:
- Access your OpenShift dashboard.
- In the left-hand menu, navigate to the Installed Operators tab.
- Select Red Hat OpenShift AI.
- Click Create DataScienceCluster.
- Unfold the Components dropdown.
- Unfold the trainingoperator dropdown.
- Set the managementState to Managed.
- Click Create.
Or just use this DataScienceCluster YAML file:
kind: DataScienceCluster
apiVersion: datasciencecluster.opendatahub.io/v1
metadata:
name: default-dsc
labels:
app.kubernetes.io/name: datasciencecluster
app.kubernetes.io/instance: default-dsc
app.kubernetes.io/part-of: rhods-operator
app.kubernetes.io/managed-by: kustomize
app.kubernetes.io/created-by: rhods-operator
spec:
components:
codeflare:
managementState: Managed
dashboard:
managementState: Removed
datasciencepipelines:
managementState: Removed
kserve:
managementState: Removed
serving:
ingressGateway:
certificate:
type: SelfSigned
managementState: Managed
name: knative-serving
kueue:
managementState: Managed
modelmeshserving:
managementState: Managed
ray:
managementState: Managed
trainingoperator:
managementState: Managed
workbenches:
managementState: Removed
Now that we have successfully deployed the necessary operators and set up the storage environment, let’s move on to running and automating the distributed AI training process on OpenShift.