Commentary: Pointing Fingers in the Wrong Direction After AWS Outage
Sites and services that went down when an Amazon Web Services region failed shouldn't blame Amazon. They should blame their own poorly architected cloud infrastructure.
If you missed the headlines last week you were probably too engrossed in fake news. Amazon Web Services (AWS) had an outage in the U.S. East-1 region of their its network S3. The differences in how this outage was portrayed in the media are both telling and disturbing.
First the "real story" from AWS, the summary of which is quite simple: An authorized user mistakenly kicked off a much wider scope of infrastructure processing than they intended, which caused a cascade of problems.
This has been summarized as a "typo" by the media, in stories such as this one. These stories are full of drama: "thousands of websites brought to their knees by a typo." That's a great headline for a mass news media outlets that seem unable to print success stories, and hang like vultures looking for any anomaly they can extrapolate to a global disaster.
AWS certainly did have a major issue last week. But it should not have caused any service outages if its customers had half a clue what they were buying from AWS.
Indeed, the following day I gave a talk about high availability through virtualisation at Streaming Forum in London. Its always good to begin a presentation with a cliffhanger. And I used that opportunity to highlight that none of my own company's AWS customers had any significant production affecting issues during last week's outage. But then we have been building "cloud scale" live and on-demand media services for almost as long as AWS has existed, so we know what we are doing.
We also know that AWS offers a 99.95% SLA for its EC2 services, and 99.9% SLA for S3 (the affected service last week).
But the rumor mill distorts the public expectation: Even the Fortune article linked above states "It also promises "99.999999999% durability." Well yes AWS does mention this in its FAQ. However, that relates to data loss within a normal availability of service. The availability is different from durability, and to mix these SLAs up to try to convey that AWS has somehow underperformed is misdirection by the media.
So if you think that simply copying a virtual machine image to AWS is the end of their infrastructure design considerations since AWS then apparently offer a 9x9 SLA behind the image, I am not surprised that an outage in a single region of AWS knocked you out last week. But had you actually read your agreement with AWS, you would have realised that to expect your application to run with 9x9 availability on an infrastructure that only offers 99.95% SLA you were on thin ice.
AWS does NOT promote itself as a concrete floor. It fully acknowledges that its SLA is "thin ice" from the outset.
And so, like a mud-skipper or crane-fly, you should be designing your applications to be hugely distributed, highly resilient, and have fault tolerance at every conceivable level. And this is your problem, not AWS's.
Personally I would have preferred to see the headlines reading "Normal fluctuations in AWS service availability expose that many cloud application developers are still architecting their applications as if they were not virtualised."
The accusative media finger should not point at AWS: It should point at their customers. Their customers are using AWS as if it was in some way magically immune to platform problems.
Despite talking about "chaos monkeys" and "disaster recovery strategies" to their bosses, they are often still building cloud-scale services as if they were building on private managed tin. Anyone can build a service that is good on a good day. But when deploying virtual capability to the cloud you should embrace failure, not try to prevent the inevitable. We market our own capability under the phrase "Good on a Bad Day"
If you multiply SLAs together, through diversity, hybridisation, treating the infrastructure as abundant, and many other techniques, you can deliver Carrier Class 5x9 SLA even on 2x9 SLA infrastructure.
But if you boot up a single S3 bucket backed by a 99.9% SLA and run the application handling that stored data on a single EC2 instance underpinned by a 99.95% SLA, then you can’t expect more than the 99.9% that the S3 bucket introduces. This means you should expect 8.8 hours of down time each year.
Even that amount is still within the ~5 hours of outage that most AWS customers had last week.
Of course if you replicate that infrastructure in the abundantly available cloud resources that are available both within AWS and between AWS and a growing myriad of other public cloud providers, then you can multiply away that risk until you are absolutely carrier class with 5x9s or more.
As I always say, "cloud is really an economic expression, and not a technical one." If you do architect properly then it makes very little difference to your operating expenses: You just use the infrastructure you need at the time you need it, and turn it off when you don’t.
So there really is no excuse for pointing the finger at AWS or the public cloud: the real risk here is in the application developers' space: Most of these guys should point the finger at themselves and do a little soul-searching or self-education.
And every time I see them complaining about public cloud, it really only serves to highlight to me how immature their own abilities are.