Home

Awesome

Making Microservices Resilient

Introduction

This repository contains instructions and tools to improve the availability and scalability of the BlueCompute sample application available at the following link

It's recommended to complete the deployment of all components of the BlueCompute application in at least one Bluemix region before going ahead with the instructions provided in this document to setup a resilient environment.

If you are not interested on understanding aspects like Disaster Recovery or scalability at a global level, you can ignore this project.

High Availability and Disaster Recovery

When dealing with improved resilience it important to make some distinctions between High Availability (HA) and Disaster Recovery (DR).

HA is mainly about keeping the service available to the end users when "ordinary" activities are performed on the system like deploying updates, rebooting the hosting Virtual Machines, applying security patches to the hosting OS, etc. For our purposes, High Availability within a single site can be achieved by eliminating single points of failure. The Blue Compute sample application in its current form implements high availability.

HA usually doesn't deal with major unplanned (or planned) issues such as complete site loss due to major power outages, earthquakes, severe hardware failures, full-site connectivity loss, etc. In such cases, if the service must meet strict Service Level Objectives (SLO), you should make the whole application stack (infrastructure, services and application components) redundant by deploying it in at least two different Bluemix regions. This is typically defined as a DR Architecture.

There are many options to implement DR solutions. For the sake of simplicity, we can group the different options in three major categories:

Scalability and Performance considerations

Adding resilience usually implies having redundant deployments, such redundancy can be used also to improve performance and scalability. That is true for the Active/Active case, described in the above section. In case of global applications, it is possible to redirect users' transactions to the closest location (to improve response time and latency) by using Global Routing solutions (like Akamai or Dyn).

Resiliency in BlueCompute

BlueCompute sample application is designed to provide HA when running in a single location; all services are deployed as redundant ReplicaSets in Kubernetes. Kubernetes continously monitors all containers and will redeploy failed containers in case of problems.

BlueCompute can be deployed in Active/Active because this is the most typical scenario for modern applications to which we demand 99.999% availability and extraordinary levels of scalability.

The Diagram below shows the DR topology for BlueCompute solution in Bluemix.

Architecture

Much of the guidance comes from this article.

Implementing Active/Active DR for BlueCompute

In this section you find the step by step guide that will help you in the implementation of the Active/Active DR solution for BlueCompute.

The main steps are the following:

  1. Deploy BlueCompute to a new Bluemix region Assuming you have already deployed BlueCompute to Bluemix US South region, you can deploy a new instance in Bluemix EU-DE region by re-following instructions at this link. It is strongly recommended to keep same naming conventions between the two deployments (Bluemix spaces, Application names, Kubernetes service names, etc.).

  2. Configure Database Replication for both MySQL and Cloudant DB as the described in the documents available at the links below:

  1. Configure Load Balancer In order to have a reliable load balancing solution to route calls to each instance, we recommend the usage of commercial solutions like Akamai Global Traffic Management or Dyn for production environments. However for development (or Proof Of Concept) environments, it is also possible to use cheaper solutions like NGINX. Also with NGINX is possible to experiment Location-based routing as documented here. However, consider that in this case NGINX is a Single Point Of Failure (SPOF). In order to setup NGINX you have to:
  1. Configure automated backup For disaster recovery, ensure that a site can recover using automated backups.

  2. Align shared secrets across sites When BlueCompute is deployed to two separate Bluemix Public environments, it's important to keep aligned shared secret configurations in both locations, so calls made to OAuth protected REST APIs by clients can be routed seamlessly to one of the two locations by the front-end load-balancer. For login protected pages and OAuth protected APIs, the same HS256 key must be used so that the same token can be used in either deployment.

  3. Configure BlueCompute Web Application and Mobile Application to point to the Load Balancer in front of the two deployments of BlueCompute.

  4. Test availability of the app Test should include the bringing offline individual worker nodes in one location, and an entire cluster.

At this point, it should be possible to use BlueCompute Mobile App and BlueCompute Web Application even when one of the two sites is unavailable.