Comment: While AWS May Have Caused Netflix's Outages, They're Really Netflix's Fault
It's about time that companies with poorly engineered services stop blaming AWS for their own engineering incompetence.
Learn more about the companies mentioned in this article in the Sourcebook:
Once again Amazon's Web Services (AWS) was pushed to the top of the tech headlines and pilloried as driving another outage, most recently for the high-profile Netflix platform (on the heels of another outage affecting Salesforce.com).
There is no denying that the outage at AWS caused Netflix to suffer significant service problems on Christmas Eve. Judging by comments on various technology forums, this has hit Netflix hard, with a lot of detriment to their profile and comments ringing with criticisms of their service and angry proponents inciting a "stampede back to cable"—which in itself is shortsighted and overly critical, but a different issue.
However the outage at AWS is to be expected. Indeed, AWS has operated well within its SLA for a long time. The problem is that many so called "cloud" services built on AWS are actually far from being genuine cloud services, and are in fact using AWS as a virtual hosting provider. The difference is nuanced, but of critical importance.
I also feel I speak from some authority, since I architected and my company subsequently built one of AWS's largest digital media live streaming workflows, and I absolutely know the difference between a cloud service and a virtual hosting service.
In a virtual hosting model, one can pay by the hour for a virtual server. You may not know exactly where in the data center that server is located, and you may not know the brand of hardware it runs on, although you will probably know the memory, storage, CPU and NIC speeds. Once you pay for that service the server is instantiated (turned on) and you can subsequently remotely access it. Should the hardware fail, you can use your own backup and restore process to bring it back up on a new hardware option, and away you go. But while you pay for the provider to look after the hardware so you don't have to visit the data center and so on, the fact is it's still just a server.
If you build a website on it and the hardware fails, then your website is unavailable until your restore process is completed.
That is a virtual hosting strategy.
If you worry about the hardware failing and pay for two virtual servers, to provide you resilience, and both of those are in the same data-center operated by the same provider, then if the first one fails you can maintain the availability of your service using the remaining virtual server, giving you continuity while you bring up the restored version.
That is a resilient virtual hosting strategy.
Lets look at the term "cloud," possibly the most over-used and widely re-interpreted term. For technicians, "cloud" gives the impression that the service user, and indeed the service operator, is "abstracted" from the underlying hardware fabric. You don't actually know where the server that is serving you is.
Well that sounds like virtual hosting, right? Yes. Most (all?) cloud models use virtual hosting for the individual server instances, but there are two critical differences that make AWS' EC2 a platform for Infrastructure as a Service (IaaS) cloud services.
- Best practice is to develop your service using multiple data centers, in multiple regions
- The ad-hoc nature of the commercial and operational model means that you should be designing your service to be continually moving the service among virtual hosts.
And so, while Netflix, Salesforce, Instagram, Reddit, and others have all had high profile outages on AWS, and "the cloud" has been perceived by some enterprises as being unstable and unsuitable for high-value, high-availability service models, the truth is that if a single availability zone (AZ) or region (Amazon's terms for clusters of data-centers) fails and their service therefore fails, the fact is that Netflix, Salesforce, Instagram, Reddit, and others have not properly architected their platform within AWS to exist at all times in multiple availability zones and regions. An outage in one area should programatically, instantanously, and seamlessly switch users to an alternative. The virtual hosts in the failed data center will have stopped being billed for—they as good-as "dont exist"—and replacements should be online in alternative places in AWS, or even in other providers' clouds within a minute. If they don't do that then the engineers in those companies have failed to grasp cloud architecture, and no amount of pointing the finger at AWS will obscure that.
We don't care if a post van breaks down. We only care if the sender puts the wrong address on the packet. This analogy is the same.
In the three times I recall serious outages (again all within SLA) in AWS, we only found out several days later after looking at our logs. Our systems failover the live audio and video encoding we do in microseconds—and with buffering the handover is frame-accurate—so it is almost impossible for our client to know if AWS has an outage. And this is critical market information; we cant afford to lose packets.
By combining the relatively low SLAs of multiple AZs, we get higher SLA out of AWS than AWS offers itself.
THAT is cloud architecture.
It seems Netflix and the others need to look at their design a little more. In the meanwhile I would like to commend AWS for offering the service they commit to and doing so well within SLA.
And the remaining, cynical critics out there should a.) learn a bit about what they are talking about before they criticise, and b.) stop blaming the tools and blame the workmen.