Note: The following is an excerpt from Operating OpenShift: An SRE Approach to Managing Infrastructure by Rick Racklow and Manuel Dewald (O'Reilly Media, September 2023). Download the e-book to learn best practices and tools that can help reduce the effort of deploying a Kubernetes platform.
Operating distributed software is a difficult task. It requires humans with a deep understanding of the system they maintain. No matter how much automation you create, it will never replace highly skilled operations personnel.
OpenShift is a platform, built to help software teams develop and deploy their distributed software. It comes with a large set of tools that are built in or can be deployed easily. While it can be of great help to its users and can eliminate a lot of traditionally manual operations burdens, OpenShift itself is a distributed system that needs to be deployed, operated, and maintained.
Many companies have platform teams that provide development platforms based on OpenShift to software teams so the maintenance effort is centralized and the deployment patterns are standardized across the organization. These platform teams are shifting more and more into the direction of Site Reliability Engineering (SRE) teams, where software development practices are applied to operations tasks. Scripts are replaced by proper software solutions that can be tested more easily and deployed automatically using continuous integration/continuous delivery (CI/CD) systems. Alerts are transformed from simple cause-based alerts like “a high amount of memory is used on Virtual Machine 23” into symptom-based alerts based on Service Level Objectives (SLO) that reflect customer experience, like “processing of requests takes longer than we expect it to.”
OpenShift provides all the tools you need to run software on top of it with SRE paradigms, from a monitoring platform to an integrated CI/CD system that you can use to observe and run both the software deployed to the OpenShift cluster, as well as the cluster itself. But building the automation, implementing a good alerting strategy, and finally, debugging issues that occur when operating an OpenShift cluster, are still difficult tasks that require skilled operations or SRE staffing.
Even in SRE teams, traditionally a good portion of the engineers’ time is dedicated to manual operations tasks, often called toil. The operations time should be capped, though, as the main goal of SRE is to tackle the toil with software engineering.
O’Reilly published a series of books written by site reliability engineers (SREs) at Google, related to the core SRE concepts. We encourage you to take a look at these books if you’re interested in details about these principles. In the first book, Site Reliability Engineering, the authors mostly speak from their experience as SREs at Google, suggesting to limit the time working on toil to 50% of an engineering team’s time.
Traditional Operations Teams
The goal of having an upper limit for toil is to avoid shifting back into an operations team where people spend most of the time working down toil that accumulates with both the scale of service adoption and software advancement.
Part of the accumulating toil while the service adoption grows is the number of alerts an operations team gets if the alerting strategy isn’t ready for scaling. If you’re maintaining software that creates one alert per day per tenant, keeping one engineer busy running 10 tenants, you will need to scale the number of on-call engineers linearly with the number of tenants the team operates. That means in order to double the number of tenants, you need to double the number of engineers dedicated to reacting to alerts. These engineers will effectively not be able to work on reducing the toil created by the alerts while working down the toil and investigating the issues.
In a traditional operations team that runs OpenShift as a development platform for other departments of the company, onboarding new tenants is often a manual task. It may be initiated by the requesting team to open a ticket that asks for a new OpenShift cluster. Someone from the operations team will pick up the ticket and start creating the required resources, kick off the installer, configure the cluster so the requesting team gets access, and so forth. A similar process may be set up for turning down clusters when they are not needed anymore. Managing the lifecycle of OpenShift clusters can be a huge source of toil, and as long as the process is mainly manual, the amount of toil will scale with the adoption of the service.
In addition to being toil-packed processes, manual lifecycle and configuration management are error prone. When an engineer runs the same procedure several times during a week, as documented in a team-managed Wiki, chances are they will miss an important step or pass a wrong parameter to any of the scripts, resulting in a broken state that may not be discovered immediately.
When managing multiple OpenShift clusters, having one that is slightly different from the others due to a mistake in the provisioning or configuration process, or even due to a customer request, is dangerous and usually generates more toil.
Automation that the team generated over time may not be tailored to the specifics of a single snowflake cluster. Running that automation may just not be possible, causing more toil for the operations team. In the worst case, it may even render the cluster unusable.
Automation in a traditional ops team can often be found in a central repository that can be checked out on engineer devices so they can run the scripts they need as part of working on a documented process. This is problematic not only because it still needs manual interaction and hence doesn’t scale well but also engineer’s devices are often configured differently. They can differ in the OS they use, adding the need to support different vendors in the tooling, for example by providing a standardized environment like a container environment to run the automation.
But even then, the version of the scripts to run may differ from engineer to engineer, or the script to run hasn’t been updated when it should’ve been as a new version of OpenShift has been released. Automated testing is something that is seldomly implemented for operations scripts made to quickly get rid of a piece of toil. All this makes automation in scripts that are running on developer machines brittle.
How Site Reliability Engineering Helps
In an SRE team, the goal is to replace such scripts with actual software that is versioned properly, has a mature release strategy, has a continuous integration and delivery process, and runs from the latest released version on dedicated machines, for example, an OpenShift cluster.
OpenShift SRE teams treat the operations of OpenShift clusters, from setting them up to tearing them down, as a software problem. By applying evolved best practices from the software engineering world to cluster operations, many of the problems mentioned earlier can be solved. The software can be unit-tested to ensure that new changes won’t break existing behavior. Additionally, a set of integration tests can ensure it works as expected even when the environment changes, such as when a new version of OpenShift is released.
Instead of proactively reacting to more and more requests from customers as the service adoption grows, the SRE team can provide a self-service process that can be used by their customers to provision and configure their clusters. This also reduces the risk of snowflakes, as less manual interaction is needed by the SRE team. What can and cannot be configured should be part of the UI provided to the customer, so requests to treat a single cluster differently from all the others should turn into a feature request for the automation or UI. That way, it will end up as a supported state rather than a manual configuration update.
To ensure that the alerting strategy can scale, SRE teams usually move from a cause-based alerting strategy to a symptom-based alerting strategy, ensuring that only problems that risk impacting the user experience reach their pager. Smaller problems that do not need to be resolved immediately can move to a ticket queue to work on as time allows.
Shifting to an SRE culture means allowing people to watch their own software, taking away the operations burden from the team one step at a time. It’s a shift that will take time, but it’s a rewarding process. It will turn a team that runs software someone else wrote into a team that writes and runs software they’re writing themselves, with the goal of automating the lifecycle and operations of the software under their control. An SRE culture enables service growth by true automation and observation of customer experience rather than the internal state.
OpenShift as a Tool for Site Reliability Engineers
This book will help you to utilize the tools that are already included with OpenShift or that can be installed with minimal effort to operate software and OpenShift itself the SRE way.
We expect you to have a basic understanding of how containers, Kubernetes, and OpenShift work to be able to understand and follow all the examples. Fundamental concepts like pods will not be explained in full detail, but you may find a quick refresher where we found it helpful to understand a specific aspect of OpenShift.
We show you the different options for installing OpenShift, helping you to automate the lifecycle of OpenShift clusters as needed. Lifecycle management includes not only installing and tearing down clusters but also managing the configuration of your OpenShift cluster in a GitOps fashion. Even if you need to manage the configuration of multiple clusters, you can use Argo CD on OpenShift to manage the configuration of a multitude of OpenShift clusters.
This book shows you how to run workloads on OpenShift using a simple example application. You can use this example to walk through the chapters and try out the code samples. However, you should be able to use the same patterns to deploy more serious software, like automation that you built to manage OpenShift resources—for example, an OpenShift operator.
OpenShift also provides the tools you need to automate the building and deployment of your software, from simple automated container builds, whenever you check in a new change, to version control, to full-fledged custom pipelines using OpenShift Pipelines.
In addition to automation, the SRE way of managing OpenShift clusters includes proper alerting that allows you to scale. OpenShift comes with a lot of built-in alerts that you can use to get informed when something goes wrong with a cluster. This book will help you understand the severity levels of those alerts and show you how to build your own alerts, based on metrics that are available in the OpenShift built-in monitoring system.
Working as OpenShift SREs at Red Hat together for more than two years, we both learned a lot about all the different kinds of alerts that OpenShift emits and how to investigate and solve problems. The benefit of working close to OpenShift Engineering is that we can even contribute to alerts in OpenShift if we find problems with them during our work.
Over time, a number of people have reached out, being interested in how we work as a team of SREs. We realize there is a growing interest in all different topics related to our work: From how we operate OpenShift to building custom operators, people show interest in the topic at conferences or reach out to us directly.
This book aims to help you take some of our learnings and use them to run OpenShift in your specific environment. We believe that OpenShift is a great distribution of Kubernetes that brings a lot of additional comfort with it, comfort that will allow you to get started quickly and thrive at operating OpenShift.
Individual Challenges for SRE Teams
OpenShift comes with a lot of tools that can help you in many situations as a developer or operator. This book can cover only a few of those tools and does not aim to provide a full overview of all OpenShift features. Instead of trying to replicate the OpenShift documentation, this book focuses on highlighting the things we think will help you get started operating OpenShift. With more features being developed and added to OpenShift over time, it is a good idea to follow the OpenShift blog and the OpenShift documentation for a more holistic view of what’s included in a given release.
Many of the tools this book covers are under active development, so you may find them behaving slightly differently from how they worked when this book was published. Each section references the documentation for a more detailed explanation of how to use a specific component. This documentation is usually updated frequently, so you can find up-to-date information there.
When you use Kubernetes as a platform, you probably know that many things are automated for you already: you only need to tell the control plane how many resources you need in your deployment, and Kubernetes will find a node to place it. You don’t need to do a rolling upgrade of a new version of your software manually, because Kubernetes can handle that for you. All you need to do is configure the Kubernetes resources according to your needs.
OpenShift, being based on Kubernetes, adds more convenience, like routing traffic to your web service from the outside world: exposing your service at a specific DNS name and routing traffic to the right place is done via the OpenShift router.
These are only a few of the tasks that used to be done by operations personnel but can be automated in OpenShift by default.
However, depending on your specific needs and the environment you’re running OpenShift in, there are probably some very specific tasks that you need to solve on your own. This book cannot tell you step-by-step what you need to do in order to fully automate operations. If it were that easy to fit every environment, it would most probably be part of OpenShift already. So, please treat this book as an informing set of guidelines, but know that you will still need to solve some of the problems to make OpenShift fit your operations strategy.
Part of your strategy will be to decide how and where you want to install OpenShift. Do you want to use one of the public cloud providers? That may be the easiest to achieve, but you may also be required to run OpenShift in your own data center for some workloads.
The first step for operating OpenShift is setting it up, and when you find yourself in a place where you’ll need to run multiple OpenShift clusters, you probably want to automate this part of the cluster lifecycle. Chapter 2 discusses different ways to install an OpenShift cluster, from running it on a developer machine, which can be helpful to develop software that needs a running OpenShift cluster during development, to a public reachable OpenShift deployment using a public cloud provider.
Download Operating OpenShift
Ready to dive into running and operating OpenShift clusters more efficiently using an SRE approach? Download the full e-book from Red Hat Developer.