How CAKE was Put to the Test and Passed – The AWS Outage
Last week Amazon Web Services (AWS) had a 16-hour, regional outage. Though this widespread incident resulted in many companies kicking into a high-gear, emergency drill, CAKE was well prepared and ready to handle the situation. The outage actually proved to be a situation that highlights the value CAKE’s team and technology deliver to our clients, as we quickly took action to mitigate the issue, and our technology enabled us to maintain business continuity.
What happened
AWS is the largest cloud-computing services and infrastructure provider, holding 45 percent of the global market in 2019. The Nov. 25 outage left some businesses offline entirely and prevented consumers from accessing services. Roku, iRobot’s Roomba app, Amazon-owned Ring security service, Target-owned Shipt, transportation services like the New York City Subway, and online publications like The Washington Post were among the many AWS customers who suffered outages or encumbered services.
As an AWS customer, CAKE was also directly impacted by the outage. AWS, having maintained reliable service for many years, clearly knows what it is doing. However, no company or technology platform is immune to problems. At CAKE we plan for this. Preparing for this type of catastrophic event is a key factor in our platform design and architecture decisions. As a result, a portion of our customers experienced only minimal disruption compared to many other companies that faced hours of inoperability.
CAKE’s response – A proactive approach by design
CAKE’s regional-resiliency model protected our customers, despite the fact that AWS’ largest, most populous region (and likewise, CAKE’s most populous service region) was inoperable. This model hosts each customer’s data in disparate regions in an “active” state, that provides immediately available up-to-date hosting and access to the client system in multiple regions around the world. The extensive time and resources we’ve put into this powerful architectectural design goes beyond the hardware-redundancy model many businesses employ. Rather than simply protecting a single hardware failure, it also provides protection against the effects of regional disruptions such as natural disasters, war, power failures, and system outages of major internet backbones, plus other types of regional catastrophes. In practice, this design gives CAKE the ability to quickly mitigate regional problems and protect our customers’ business by shifting traffic and services for all clients to other global regions in near real-time.
In many cases, these protective actions, which include shifting traffic, happen automatically based on event-triggers and alarms. At CAKE we have hundreds of predetermined triggers to avoid any disruption to our customers’ business. Our extensive monitoring systems will automatically shift traffic or scale up infrastructure in response to errors, increases in latency, or system resource pressure. However, in some scenarios there are certain actions that require manual intervention, either because they have not yet been automated or due to the services, which our event-triggers rely on, being down. At CAKE, we have measures in place to proactively intercede in the event that manual intervention is required. Our clients benefit from the 24/7 team of Live Operations (LiveOps) engineers that actively monitor our platform and respond in real-time to any potential threats or issues that could interfere with our infrastructure.
This scenario played out on Wednesday, and due to the forward-thinking vision and design of our platform, and the effective action of our LiveOps team; only a portion of customers were impacted by the AWS outage and the resulting impact was minimal. Additionally, when the traffic was moved to other regions, the LiveOps team was able to respond to the new infrastructure demand in those regions quickly by scaling our system accordingly within minutes.
I’m extremely proud of our CAKE team. Their deep knowledge and technical vision consistently deliver revenue-driving benefits to our customers. This past week’s large scale outage is a prime example of our CAKE team’s efficiency, effectiveness, and innovation. While others flatlined, CAKE kept going by taking action to quickly mitigate the issue and maintain up-time for our clients’ businesses.
To learn more about CAKE’s business and technology stability, visit our blog “How to Prepare for Growth by Relying on Stable Technology,” here.