Skynet, EC2, and Zencoder

Yesterday was April 21, 2011, the day that Skynet becomes self-aware and destroys the earth in the Terminator universe. Yesterday, the internet saw the worst outage in cloud history when Amazon EC2 experienced a catastrophic failure. Unfortunately for our customers, Zencoder was affected by this EC2 failure. We were unavailable for much of the day due to this outage. This is a serious problem. We know that our customers couldn't encode video for much of yesterday, and experienced unacceptably long delays in getting their videos processed. We sincerely apologize for the downtime, and we are making changes to our system that will ensure that it never happens again.

What went wrong

Details are still coming out, but as we understand it, Amazon's EBS file storage system in its US East region failed spectacularly. The Amazon status page puts it this way:
A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it's difficult to create new EBS volumes and EBS backed instances.
In other words, problems with some EBS volumes cascaded and grew until all EBS volumes were taken down. The capacity issues compounded in such a way that it was extremely time consuming to restore the system to health. In fact, now - 40 hours after the first problem - Amazon is still in the recovery process, with no immediate end in sight. Zencoder uses Amazon in two ways. First, we encode most of our video using EC2. We're big proponents of EC2, and believe that EC2 is a great system for large-scale video encoding. It's important to note that the Amazon outage didn't affect our ability to run encoding servers; if it weren't for the next point, we would have had no problems yesterday. Second, we run our dashboard and API on Amazon EC2, and depend on EBS. This was the cause of the problem yesterday. The EBS outage took down our database, and rebuilding took significantly longer than anyone expected. We worked to get this back online, but Amazon provided very little information, and eventually it became clear we had no idea if, when, or how service would be restored. EBS was a single point of failure for us. Not in the sense that we couldn't tolerate a single EBS failure; the failure of a running EBS volume wouldn't have caused problems like this. Rather, our point of failure was reliance on the EBS system as a whole. We anticipated "what if our EBS volume has trouble?", but not "what if the entire EBS platform becomes unavailable?". Beyond that, we didn't have the right plan in place to deal with a catastrophic outage. We do now, and more on that in a bit.

What we did

We did a number of things during the outage. We updated our status page (http://status.zencoder.com) and posted updates throughout the incident. You can see the history here: http://status.zencoder.com/events/6. We tried our best to be transparent and responsive in these messages. We redeployed our application to a new set of servers, outside of EC2, and helped transition customers to this backup environment. We didn't immediately point our core application to this new environment, to ensure that we had application and data integrity, but we got many of our customers up and running on this new environment mid-afternoon yesterday. When we could, we brought our main site back online using these new servers. The whole technical team dropped what they were working on to focus on this. We made ourselves available in our customer chat room, responded to emails quickly, focused on migrating to the new environment, and helped customers move over to the workaround. Finally, as best as we could, we monitored the situation at our service providers and stayed in touch with them.

Why this will never happen again

We are taking several steps to ensure that this never happens again. First, we are migrating our API off of Amazon. Problems with EC2 or EBS will no longer affect our ability to operate or encode video. As we do this, we will ensure that servers and data are replicated in a way that minimizes downtime. We will continue to encode video on EC2, which (again) was unaffected by these problems, but we will also add non-EC2-based encoding options for another level of redundancy. Second, we will set up, test, and maintain an alternate server environment that we can gracefully switch to in an emergency if our primary system becomes unavailable. Third, we will improve our internal processes and planning when dealing with a catastrophic outage. This planning is still in progress, but will involve additional communication channels with customers, expanded 24/7 monitoring, and protocols for emergency migrations.

The good news...

If there were good news, it would be this. First, we believe that we experienced no data loss. We were able to recover everything from our failed EBS volume. Beyond that, we have hourly off-site backups that would have helped to minimize data loss, if any had occurred. Second, the engineering team was notified of the problem within 1 minute. This didn't prevent the outage from happening, or from lasting far too long, but it meant that we were able to work on the situation the entire time. Third, when service was restored, jobs started processing immediately and automatically. So the end result is job latency, not job failure. (We know that some customers are more and less accepting of latency, so this is good news for some, but for others the latency was just as bad as failures.) Finally, we offer a 99.9% uptime SLA. The SLA explicitly excludes Amazon Web Services unavailability, but we are going to ignore the exclusion and and give customers an encoding credit of 20% of their monthly bill. We are also going to remove the AWS exclusion from the SLA going forward. Again, we apologize for this problem, and take it as seriously as you do. We will work through the weekend to ensure that nothing more goes wrong, and to put these new safeguards in place. And as always, let us know if there is anything else we can do to help.