Most developers understand the importance of redundancy at different levels in a web application’s architecture.
However, events like the Sept. 20 Amazon Web Services (AWS) outage prove that leveraging tools in just a single region, even with the redundancy they provide across availability zones, is still not enough. The only method of absorbing these types of outages is to spread the load across two or more regions. While cross-region redundancy requires major architectural work and increased cost, any service claiming to maintain even a 99.9% SLA must build applications with full regional failures in mind.
Redundancy at the data layer has traditionally received the most emphasis, as data loss is one of the worst-case scenarios. For this reason many database platforms support a combination of clustering, replication and mirroring to ensure at least two copies of data are maintained at all times.
Built-in redundancy in hosting/serving static content has been trivial for many years with the proliferation of CDN providers and storage frameworks like Amazon’s S3.
Finally, redundancy at the services/processing layer has been made easier by Cloud providers with tools like load balancers, auto-scaling, distributed caching and NoSQL database options, as well as technologies like Docker and Amazon’s Lambda. It’s great to see these tools becoming the standard as it means a more resilient/reliable Internet for all.
The main decision to be made at each tier and for each application in a cross-region redundant architecture is whether it should operate in an active/passive manner or active/active.
Active/Passive entails that processing only occurs in one location at a time, where the secondary location only takes over if the primary has failed.
Active/Active processing can occur in two or more locations in tandem. Certain application requirements and certain technologies can limit the use of an active/active architecture, but wherever possible, we highly recommend this structure. It not only makes more effective use of resources (idle resources have a massive opportunity cost), but it also avoids regularly testing (performing failovers) for regional disaster recovery, as that comes built in. In addition, it opens up the opportunity to reduce latency worldwide by applying latency based DNS routing to send users to their closest datacenter.
Furthermore, regardless of active/active vs. active/passive a DNS service that supports health checks with automatic failover is imperative. The main complication of active/active in the event of a full-regional outage is that your other region(s) will need to be able to handle the full load of both regions. Fine-tuned auto-scaling makes this much easier, but we recommend load tests that double traffic volume in a short period to ensure applications can scale quickly in those worst-case scenarios.
While architecting for regional redundancy is more complex and inhibits some organizations from implementing it, we recognize that cost is the main deterrent for a majority of companies. Many people believe cost grows linearly when adding new datacenters, but this is only the case if all services were active/passive and there was no ability to auto scale.
It should actually be a relatively small increase in cost to add new datacenters if applications are architected accordingly. And when considering the performance and reliability enhancements inherent with latency-based DNS routing, the cost is more than justified. Additionally, in the event that an outage does occur, the (opportunity) cost during the outage period can quickly dwarf the hosting cost increases discussed.
Whether you’re a SaaS company who must credit customers for a breach of SLA or a site driven by ad revenue that misses out on hours demand, the insurance of multiple regions cannot be ignored.
The CAKE Approach
At CAKE, we leverage DynamoDB for a few of our applications and were able to respond immediately Sunday morning when the outage occurred. Because our applications are built to be active/active, we were able to quickly shift all traffic from Virginia to Oregon (us-west-2), avoiding any major downtime for our clients.
With a digital marketing tracking platform that runs 24/7, CAKE understands the damage done when our clients lose service and as such, we take reliability and uptime very seriously. We know our clients have little patience for outages, so if we want to retain our customers we must stay up. Bain & Company has stated that “…a 5% increase in customer retention can mean a 30% increase in profitability for the company.”
Maintaining our SLA is the CAKE engineering team’s highest priority and that was proven Sunday morning. We encourage SaaS customers not to accept the excuse of blaming hosting providers for regional outages and accepting the loss as unavoidable. While we’ve heard other tracking platforms take this approach, this is something CAKE will never do. When events like the one on Sunday occur, it becomes clear that not all platforms can really back up their claims of true reliability, but we at CAKE are proud to show off of our continuous effort to keep our service running at all costs.