High Availability With A Hosted SaaS Solution
In a real-time, extremely demanding industry like ours, time very truly is money. For CAKE, a service interruption would not only carry the universal reputation degradation that would affect any other online service, but would also manifest in quantifiable dollars lost.
So, what are some of the ways modern “always on” solutions are architected? Let’s take a look at some of the considerations of creating a highly available online service from scratch.
At the bottom of our stack we have the actual server and network hardware the software runs on. A very common way of providing both fault-tolerance and the ability to scale horizontally is the use of a load balancer. These devices sit in between your users and your webservers. They distribute incoming requests to any one of several servers that are equipped to field that request. For example, let’s say you host a website and you would like to ensure you can continue to serve your customers even if an entire web server goes offline. A load balancer can automatically detect if a web server has failed and will no longer route incoming requests to it. This is seamless to your users so your service continues uninterrupted. One thing to consider here is how much capacity you have in the way of available servers if one or more fail. If you had four web servers handling all the load for your web service and one fails unexpectedly, are the remaining three capable of handling the entire load? How quickly can you scale back out?
Taking another step back, what would happen if the data center your servers are located in experiences a major environmental or network related outage? All your meticulous planning, server redundancy, and load balancing goes right out the window if your users aren’t able to reach any of it. Geographic separation is another important piece of the puzzle. Ideally, you would have your entire redundant server stack architecture replicated in at least one other data center.
A commonly overlooked aspect of this part of the design is just how much geographic separation is enough. There have been many instances where solutions are deployed in different locations in the same city or state. This makes it easier for management if you need staff to physically service both data centers but it isn’t quite as resilient as one might think. While rare, it is not unheard of for an upstream network provider to experience a major regional problem. This can and has resulted in multiple, seemingly separate data centers being unreachable simultaneously. Additionally, the same capacity issue arises. If both data centers were sharing the load for your service and one is suddenly not available, will the doubled demand on the remaining site effectively overload it?
Now let’s take a look at the top layer. You’ve got your multiple data centers full of load balancers and redundant servers and everything is working great. Your next decision is what type of failover model to use. There are a few strategies to choose from that would be appropriate here. The most common in this use case would be active-active and active-passive (warm stand-by). In an active-active scenario, we would have all of our data centers participating in serving up our product. As the name suggests in the other scenario, we would have our other data center running in stand-by mode with no traffic going to it. If there was a problem, we could change our DNS records to point traffic to the other data center instead.
The option we will go with here is an active-active solution. This offers us an additional feature that comes in really handy in a latency sensitive product like a tracking solution. If we have both of our data centers fielding requests, we can actually route end users to the data center closest to them. There are a number of enterprise level DNS providers that offer this latency based routing feature. When someone attempts to access one of our URLs, the DNS service looks at their IP address and then answers their request with the IP address of our site closest to them. Another really cool technology we can leverage here is the ability for the DNS service to monitor your data centers health. Just like the load balancer can detect failed web servers and stop routing to them, the DNS service can detect failed data centers, and stop sending traffic there.
As you can tell by this high-level overview, there is an ever-evolving landscape of new technologies and techniques. Like any good service provider, at CAKE we love to stay abreast of the best and most reliable way of providing the best-in-class service to our customers!
Happy architecting!