Several organizations are wondering (and sometimes struggling on) how to port their current workloads to cloud environments.
One of the main characteristics of a cloud environment is that the infrastructure is provisioned dynamically. This implies, for example, that we don’t know a priori where our resources are being allocated (we can find that out, though). VMs or containers will receive a dynamic IP. Storage will be allocated somewhere and attached to our VMs or containers and so on.
So, how should we design our applications to cope with this dynamicity?
Several companies have struggled with this issue, but, in my opinion, two stand out mainly because they shared their findings with the community: Netflix, which pioneered microservices on Amazon AWS and shared back with the community reusable cloud native components and libraries via the Netflix OSS site. And Heroku, which is a cloud PaaS that supports many platforms but started with ruby on rails and captured a series of guidelines on creating cloud native apps in the 12factor app site.
Building on the shoulders of these giants, here is a list of cross-cutting concerns that a cloud native solution should address:
Service discovery: service instances are created dynamically, so we need to discover them.
The ingredients of a discovery process are a service registry and a discovery protocol. The process obviously involves registering/removing service endpoints, as they are created/destroyed and executing service lookups.
There are two major approaches to this problem:
- Explicit discovery management: Netflix OSS and other stacks use a service registry (Eureka, Consul, ZooKeeper) to register and discover services. You have to explicitly install your service registry and have your services register and deregister. Also the mentioned pieces of software usually expose a proprietary discovery protocol. This approach works well when you control the code and you can put the registration logic in your service providers and the discovery logic in your consumers. This approach does not work with legacy applications or applications of which you don’t own the code.
- Implicit discovery management: with this approach, the cloud cluster manages the service registry and updates entries when new service instances are created. The cluster manager will in this case also likely expose the service registry via the DNS. This approach works with new and old software because all applications that use the IP protocol to communicate understand how to use DNS. Kubernetes, OpenShift and Docker Swarm use this approach. In my opinion this approach is superior, being less intrusive and will become the standard de facto.
Note that the two approaches can coexist within the same environment.
Load balancing: there will be multiple instances of a service to ensure HA and support the load. There are essentially two strategies for load balancing requests over a cluster:
- Client side load balancing: in this case the client knows of all the endpoints and chooses which one to call. This approach requires the client to be designed to handle load balancing. A popular load balancing library is Ribbon from the Netflix OSS stack. In Spring Cloud, Ribbon can be configured to use different discovery mechanisms to obtain the list of available endpoints.
- Infrastructure-based load balancing: with this approach the infrastructure takes care of load balancing. The client application knows of one stable endpoint that can be passed as a configured environment variable and the infrastructure takes care of load balancing all the requests to the currently available endpoints. Again Kubernetes and Docker Swarm use this approach. This approach works better with "older" pre-cloud native applications that do not have intelligent client-side libraries.
Configuration Management: following the principles of immutable infrastructure, once an app is built it will be crystallized in an image (be it a VM or container image), and we cannot change it anymore. And yet we need to deploy it in several environments as it follows its promotion process. How do we deal with environment dependent properties and other properties that we may want to tweak? There must be a way to inject environment-dependent properties in the image. The environment variable at the very least should be supported as a way to inject properties. Spring Boot has a very nice way of managing configuration, by which it accepts configurations through many different channels (including environment variables and even a git repo), it aggregates the entire configuration and then makes them available to the code or even libraries imported as dependencies. Archaius from the Netflix OSS extends the common-configuration library from Apache adding the ability to poll for configuration changes and dynamically update the runtime configuration.
Data and state management: this includes any application component, which manages application state, including databases, message queues, in-memory caches and the like. In a cloud environment, virtual machines and containers are usually ephemeral and come and go, taking their present state with them. To ensure durable data management, there are two common approaches - either use external storage where data files are kept or replicate the state among multiple instances and use a consensus algorithm to ensure instances are aware of each others. An in-depth treatment of this complex topic is out of scope of this article.
Log aggregation: not a new issue, log aggregation becomes mandatory in a cloud environment because VMs and containers are ephemeral and when destroyed, their logs may potentially be lost. You want a log aggregator solution to peel off the logs from each VM/container instance and place them in a central, persistent location. Following the 12factor guidance on logs, applications should log to stdout at which point the cloud infrastructure should be able to automatically collect and correctly classify the logs. At the moment as far as I know only OpenShift does it (using an EFK stack). For legacy applications that log to one or more files and cannot be refactored, I generally suggest to build a sidecar container that watches the logs and forward them to the enterprise log aggregator.
Distributed tracing: this is the ability to follow a request as it traverses the various layer of our solution and determine how time is spent during that journey. It is a fundamental tool to profile distributed applications and almost mandatory for those solutions that have multiple layer of architecture. There is an ongoing effort by the cloud native computing foundation to standardize how this data should be collected so to decouple the code that generates the tracing data from the product that collects display it via the open tracing initiative. Zipkin has been the historical de facto reference implementation for this capability in the open source space. No cluster manager as far as I know takes care of this aspect, but it is easy to predict that when a standard emerges, cluster managers will start to provide some support to this capability. Distributed tracing is usually linked to application monitoring (which is not a new concern). Software such as Hawkular APM (and many other commercial packages) provides both distributed tracing and application monitoring in single tool.
Fault and Latency tolerance: networks will fail or slow down. The circuit breaker and the bulkhead patterns help greatly in managing this type of errors. Netflix had led the way in this space by implementing these patterns in a Java library called Hystrix. Adding the circuit breaker pattern to your outbound calls is now just as simple as adding an annotation. Portings of the hystrix library exist for JavaScript and .NET (and other languages). Netflix has actually embraced failure at a more fundamental way by adopting techniques from the antifragile concepts developed by Nassim Taleb. This has led to the creation of the Chaos Monkey and eventually the Simian Army. While I don’t think that a cloud native application should necessarily adopt these strategies, the idea of injecting controlled failures in a system to make it stronger is interesting and should be considered by companies for which availability and resiliency are a critical KPI.
Feature Toggles: the feature toggles pattern is about having the ability to deploy code that implements an incomplete capability and keeping it disabled via configuration flags. This allows a development team not to use feature branches and to do exclusively trunk development. Jez Humble includes this practice in his definition of continuous integration. Ultimately the trunk development approach allows you to deliver faster because no time is spent reconciling feature branches. This marries with continuous delivery, which is almost a mandatory technique when developing cloud native applications. I find this space to be a little green still, but here are two frameworks that implement this pattern: ff4j and togglz.
Health checks: there should be a way to know if an instance of a component is in good health or not. Something beyond controlling if the relative process is up, something that tells us that that particular instance is still well performing. All cloud-native applications should expose an HTTP endpoint to check the health of that app, or if HTTP is not viable at least describe a way by which health can be checked. This information can be used by the cluster manager (and potentially other pieces of the infrastructure) to make decisions such as evict the instance or remove the relative endpoint from the service registry. Ultimately exposing health checks allows the system to implement self-repairing strategy (one of the aspect of anti-fragility). A good example of a framework that allows you to easily create health checks is Spring Boot Actuator.
Conclusions
None of these cross-cutting concerns need to be addressed immediately when your application is migrated to the cloud. It is therefore possible to organize the migration of workload to the cloud as a set of incremental steps in each of which more and more architectural concerns are addressed and more and more benefits are gained.
For additional information and articles on .NET Core visit our .NET Core web page for more on this topic.
Last updated: March 22, 2023