3 Mar 2017

The Calm After the Cloud Storm – Our Take on the AWS S3 Outage

I have been using the public cloud since 2008. In the early AWS years, it was not uncommon for AWS to experience 2 or 3 major outages a year. Recently, outages have been very rare. Here is a list of major issues over the last few years.

In 2013, the EBS (Elastic Block Store) API was unavailable essentially taking many applications and companies down for several hours.
In 2015, DynamoDB was unavailable causing outages for companies that leveraged that API
In 2016, a DDoS attack affected Dyn which AWS quickly stopped using and routed traffic to other providers. Users experienced minimal collateral damage.
On February 28, 2017, human error caused a major disruption of S3 (Amazon’s Simple Storage Service) and cascading performance issues on a number of other AWS APIs. Many companies experienced outages.

This latest outage had the biggest impact on customers in 4 years. So naturally, the response from social media and the blogosphere was that hybrid clouds or data centers are the answer. Server huggers everywhere rejoiced and said “see I told you so” as they prepared to recycle millions of dollars of infrastructure while adding no business value.

I totally disagree with these assertions. The answer is always to “design for failure” on the cloud provider you have put your chips on.

There is nothing wrong with pursuing a hybrid or multi-cloud strategy. However, suggesting the best way to combat outages like the S3 event we just witnessed is to pursue multiple cloud providers is short sighted. Before investing the time, money, and resources to pursue a multi cloud strategy for the sole purpose of redundancy and HA (high availability), consider these things:

Make sure your current architecture is designed for failure. For example, AWS recommends cross-region replication for S3 for HA and DR.
Test your systems for failure. If you just assume every AWS API will always work and don’t design and test for use cases where they fail, you get what you deserve when the service is down. Simulate outages, test HA and recovery solutions, leverage the Simian Army, etc.
Understand the true TCO of adding another cloud vendor to the mix and don’t underestimate the effort.

Is multi-cloud worth it just as a safety net? I think with all the buzz with containers and tools like Terraform, there is a perception that whatever you build on one cloud is easy to port to another. Let me dispel that myth.

I agree that the the OS and virtual machines can be very portable. Wrap them in a container and they should run easily on any cloud or non-cloud provider (assuming your setup is cloud agnostic). But the OS and virtual machines are only a small part of the story. What about identity access management (IAM), Virtual Private Cloud design (VPC), security tooling, networking, etc.?

I have yet to find a viable solution to abstract the cloud vendor specific APIs around security and networking so that you can design once and port many. You could argue that a PaaS solution handles that for you but there is still a one time integration that takes place for each cloud. There are many good arguments for and against pure play PaaS solutions which I’ll leave for another post on another day.

It takes a great amount of upfront work to harden what AWS calls the “landing zone” or the infrastructure required to run your apps. When moving to your next cloud provider, much of the landing zone buildout is not portable. Each cloud vendor has its own APIs for the security and networking capabilities. There are also one time integrations with the various 3rd party security, monitoring, and logging tools. Many SaaS implementations of those 3rd party tools are either too expensive or lack the required feature set of their non-SaaS version, so they must install and manage those tools on the cloud providers infrastructure. Of course, the implementation of each of these tools are different on each cloud provider because of the different security and networking APIs of each cloud provider.

Then there is the cost. If you have hundreds of terabytes of data stored on S3, do you really want to duplicate that on another cloud provider just as a safety net for the one time a year that an outage might impact a storage API? Maybe you do, but I would explore every last solution to provide redundancy and HA on the existing cloud provider first.

Then there is the time. Remember how long it took and the number of iterations required until the existing landing zone was actually usable in production? Do you want to go through all of that again with the next cloud provider whose APIs and architecture is completely different? What does the business give up when you make that investment? What features and services could you deliver in that same time frame that could add to the company’s bottom line?

Let me be clear here. I am not advocating that hybrid or multi-cloud strategies are bad. What I am saying is if the sole purpose of the strategy is for failover, I believe that is a misguided strategy. The best reasons for hybrid or multi-cloud strategies is to leverage the best cloud APIs to solve your business problem. For example, it is not uncommon for a company to use AWS for most business problems but leverage Google for its big data and analytics capabilities.

Cloud providers will experience outages to services on rare occasions that may impact your ability to keep your applications running. Do keep in mind that the frequency that this happens is significantly less than what almost every company in the world experiences in their own data center. So let’s stop thinking we can run data centers better than Amazon, Google and Microsoft.
Before embarking on an expensive and time consuming journey to stand up a second cloud for HA, explore every possible option to provide HA on your current cloud provider. Every time there is an outage you see the stories of companies who went down for hours. There are also many companies who don’t go down because they anticipated that every API can and will fail at some time.
Ignore the noise. When these issues arise, all of the vendors who are getting their butts kicked by AWS come out of the woodwork to claim they told you so and that you should use their stuff. Private cloud and datacenter motivated companies and people all start shouting in caps. But when the storm passes (usually within a day), businesses go back to normal. Companies who have successfully implemented software and services on the public cloud are creating competitive advantages over those that are not. Noise is just noise.
Study the post mortem from the cloud provider. More importantly, study the lessons learned from the companies who survived the outage and those who did not. Each hiccup will make both the cloud provider and its customers stronger if we all learn from it.
Keep calm and cloud on!

This article originally appeared on The Doppler and has been reposted here.