Have you thought about how much downtime is acceptable to your organization? How long will it take before your customers notice and start calling your help desk? How much will it cost you to respond to and manage the outage? The cloud service providers do a great job of handling high availability within their data centers. Most even provide options for automatic scalability as your demand increases. But your application up time is irrelevant if the data center becomes unavailable.
The Amazon EC2 SLA, Microsoft Azure App Service SLA, and Google Cloud App Engine SLA all set an acceptable limit of up to 22.5 minutes of downtime per month (99.95% availability). For larger enterprises and highly transactional businesses, this may not be acceptable.
During one of our early cloud projects, we touted the benefits of the cloud infrastructure to the stakeholders and achieved their buy in. Within a couple of weeks of our initial rollout, we were hit with the inevitable outage. Despite the outage being relatively short lived, the end users noticed it and we had some egg on our face. After the problem was resolved, we met with our stakeholders to perform a post-mortem analysis. While the timing was unfortunate, we have all experienced similar outages in our own data centers in the past and agreed that the public cloud still provided superior service and cost advantages over internal hosting. While we accepted that outages were going to be a part of life in the cloud, we did not accept that we were helpless to mitigate them and committed to finding a solution.
The only true solution to a regional outage requires the application to be run in multiple regions. This requires a financial commitment as it effectively doubles your cost to bring the second region online. Fortunately, cloud costs are very reasonable and the additional resources available to your solution can provide an improved level of performance for your end users while both regions are online.
Now that you are running in more than one region, how do you gracefully handle a regional outage? This is where advanced DNS services come in.
Each of these services allow you to configure a set of equivalent endpoints from which to route clients to. Under normal operations, you can use this service to optimize your end user’s experience by routing them to the closest server. But in a regional outage scenario, the service will detect one of the endpoints is offline and stop routing traffic to that endpoint. This feature is achieved by using endpoint monitoring.
Endpoint monitoring is pretty basic stuff – it will hit a given path and simply check for an HTTP 200 OK status. We are now prepared to handle a regional outage, but these advanced DNS services do nothing to protect against any other application level outages that may occur. At least not by default.
We decided to create a custom application status page on our website. When this page was hit, it would check the availability of application dependencies – databases, storage, and service bus. If any of these services were unavailable, the page would return an HTTP 503 Service Unavailable response. This allowed us to use this custom page as the DNS monitoring endpoint and provide additional context under which to take an endpoint offline.
As we used this model, we made several improvements to get additional value from this page. First, we formatted the page results as JSON so we could easily consume it from our monitoring systems. Second, we added some basic system information like library versions to the output to track running versions (be careful not to divulge any sensitive information). Finally, we added caching to retain the status in memory for a couple of minutes. Checking all of these resources ended up being an expensive call, so this became a safeguard against a malicious actor.
Public cloud offerings have matured significantly over the last few years and generally provide much better results that any internal department could deliver. It is important to assess your availability requirements and plan for an outage before it happens. Current service offerings allow you to easily run in multiple regions and avoid the occasional outages that will hit each provider from time to time.