Notes on surviving today’s EC2 outage
Today, Amazon’s US-EAST region suffered a network failure, resulting in at least the us-east-1a availability zone becoming unreachable from the internet.
This, somewhat predictably, took many of the web’s biggest names off the air with it.
Amazon has repeatedly reminded users of their cloud infrastructure that they need to do the hard work in order to take advantage of Amazon’s famed redundancy. Your application needs to sit across multiple availability zones rather than just being deployed at run-time to an arbitrary, solitary zone.
This is, of course, rather more difficult than it sounds. For Wormly it means keeping a hot standby running in an alternative zone. Today when our primary web cluster in the us-east-1a zone failed, our system detected this automatically failed over to the hot replica which was running in the us-east-1b zone.
Our total website downtime was kept to under 2 minutes, and our globally distributed monitoring system (which does not run on EC2, and is designed to tolerate failures of the web cluster) experienced no downtime at all.
Having to keep a hot replica running in (what is hopefully) a separate data center certainly isn’t the most cost effective approach – and it’s presumably not what DevOps were hoping to gain from the cloud revolution.
But for us the extra cost is worth it, not least for the fact that when major failures like this occur, our customers appreciate being able to log into Wormly and switch off the alarms which are ringing their phones off the hook.
Availability zones not sufficiently isolated?
Immediately after a failover, our system attempts to bring up a new hot replica in another availability zone, to ensure that we continue to maintain at n+1 cluster redundancy.
During the 30-odd minutes that US-EAST-1A remained offline, we noted that our system was unable to successfully bring any new EC2 instances up, in any US-EAST zone. This was quite an alarming revelation, and we await Amazon’s response on why this occurred.
Availability zones are supposed to be sufficiently isolated to ensure that failures do not cross zone boundaries. Our inability to bring new instances online meant that we were heavily exposed to further failures; there were no further hot replica’s available for us to fail over to. Luckily US-EAST-1B didn’t experience any failures during this window.
Who else do you rely on?
Another unfortunate side-effect of today’s outage was that it brought down Twilio‘s API. We use Twilio to send SMS message to our US customers, because they offer excellent speed and deliverability to US cell phones. We do have alternate routes in place, so during this outage we were able to automatically route around the problem.
This highlights the importance of having redundancy built into your choice of web services when building applications in the cloud.
We note that RightScale‘s website was unavailable during the outage. Given that many major sites use RightScale to manage their EC2 deployments, this presumably caused quite some consternation among their users.
The cloud is fragile, be prepared!