Overview: RoCE multi-node AI training on Red Hat OpenShift
In this learning path, I'll demonstrate how anyone can run a distributed AI workload on Red Hat OpenShift using just a few nodes and GPUs. We’ll start with a straightforward manual training setup to grasp the basics and keep things simple, and then we’ll move on to a fully automated training procedure. This will give you a solid foundation that you can expand upon to tailor your infrastructure to your specific needs.
This path will guide you through training the resnet50 model using PyTorch on Red Hat OpenShift. I chose the resnet50 model for its balance between training speed and accuracy.
We'll be using the CIFAR-10 dataset from the University of Toronto for this training. While I'll be demonstrating how to set up the remote direct memory access (RDMA) over Converged Ethernet (RoCE) training environment, the same process can be applied using transmission control protocol (TCP) with any ethernet setup.
The instructions in this path assume that you already have a running OpenShift cluster with single or multiple nodes. You may want to deploy the cluster with the Assisted Installer, which is an interactive installer wizard. Another option would be JetLag, a tool that utilizes installer-provisioned installation (IPI).
With the cluster up and running, we’ll deploy a few operators, which can be done via the command-line interface (CLI) with YAML files or the OpenShift user interface (UI).
All the files used in this training tutorial can be found in this GitHub repository. This entire tutorial should take around 120 minutes for those with basic OpenShift experience.
Prerequisites:
- A running OpenShift cluster with single or multiple nodes.
- Access to the files in this GitHub repository.
In this learning path, you will:
- Run a distributed AI workload on OpenShift.
- Train the resnet50 model using PyTorch on OpenShift.
- Set up the RoCE training environment.
- Install a few operators on the cluster.