The Amazon Cloud had problems on 21 April 2011, taking down a lot of other products. Almost simultaneously, Sony’s Playstation Network went down.
First, here’s the marketing advice, from GigaOm.
So if we accept that clouds go down and that people will be inconvenienced, then really the only thing to do is communicate well and perhaps add a sense of humor to your message–provided that the loss of your service isn’t going to reduce your audience to cursing. Gmail, I’m looking at you! Quora, for instance, provided a cute YouTube video and the following error message, “We’d point fingers, but we wouldn’t be where we are today without EC2.”
Foursquare provided a running display of information on what was happening, Much like the trend in fun 404 pages, the habit of making light of a failure can endear your service to readers.
Update on 9 June: Here are some custom error pages, via WebSitePulse. If you want your own cute “cloud error” page, you could do worse than use lyrics from “Both Sides Now”, by Joni Mitchel (youtube).
ROWS and Flows of angel hair
And ice cream castles in the air
And feather canyons everywhere
I’ve looked at clouds that way
But now they only block the sun
They rain and snow on everyone
So many things I would have done
But clouds got in my way
An example of how not to do it, from Sony (details at TechCrunch):
Now for the timelines and tech details
Final Update about Amazon on 29 April: It seems the problem started when Amazon needed to upgrade hardware and so switched internal comms onto backup hardware. An engineer made a mistake and isolated cloud servers from each other. Many of the isolated servers assumed a serious error had occurred and each tried to create a backup instance (copy) of itself, before continuing. So they were in effect offline until Amazon installed huge amounts of additional hardware to allow these copies to be made. Of course, it was a bit more complicated than that! Here’s the history as I wrote it:
The problem was (based on justinsb’s posterous):
- The Amazon Cloud is structured by geographical regions and then independent “availability zones” (AZs) within them.
- A typical Website runs in one region and AZ, using a second AZ in the same region as backup.
- Several (all?) AZs in the “us-east” region failed almost simultaneously. I don’t know exactly why (see the links labelled “Tech Details” below for what experts think), but clearly they were not independent enough. This took down a lot of Websites and their backups.
- Fixing this situation took a long time, because clouds are difficult to reboot. Many technologies they use, such as NoSQL databases, get their speed by storing pretty much everything in memory. If you reboot them, they have to reload all this data from disk, which is much slower, or from another server which can overload the internal network. Thus when engineers are trying to fix a problem by experimenting with config or hardware changes, it feels a bit like one of those nightmares where you have to escape while wading through deep mud.
Final Update about Sony on 9 June: Sony’s Security was poor; they got hacked; and they had to retro-fit better security onto the PSN before it could be brought back up. That took until the first week in June to fully complete. Here is a conference slide, showing How PSN was hacked
Tuesday 19 April 2011
- Amazon Web Services (AWS) Announced Live Streaming For CloudFront [Seems to have been a Coincidence] - Amazon
Thursday 21 April 2011:
- Amazon Status
- Amazon Customers
- Amazon Cloud Goes Down, Takes Every Hot Startup With It - BusinessInsider
- Amazon’s Cloud Crashed Overnight, And Brought Several Other Companies Down Too - AllThingsDigital
- Amazon EC2 troubles bring down Reddit, Foursquare, Quora, Hootsuite and more - TheNextWeb
- Amazon cloud sinks, smothers Web 2.0 darlings - The Register
- Business Details. What to Do When Your Cloud is Down - Bob Warfield
Friday 22 April 2011
- Tabloid Summary - Mail Online
- Tech Details, What Went Wrong - justinsb’s posterous
- Tech Details - The Register
- What We Can Learn From Amazon’s Cloud Collapse - Mashable
- Amazon failure takes down sites across Internet - Yahoo News
- The Day The Cloud Died - CIO Central
- Sony’s PlayStation Network Suffers Prolonged Global Outage [affecting 70 million users; unrelated to Amazon] - TechCrunch
Saturday 23 April 2011
- Sony says issue was “An external intrusion on our system” [unrelated to Amazon] - PlayStation Blog
- 3 Day PlayStation Network Outage [unrelated to Amazon] - TechLand
- Problems continue. Screenshot of Amazon Status - Flickr
- The AWS Outage: The Cloud’s Shining Moment - O’Reilly Community
Sunday 24 April 2011
- “We continue to make progress” - Amazon Status
- How our small startup survived the Amazon EC2 Cloud-pocalypse - Eric Silverberg via @AutomatedTester
- What are the lessons learned from April 21, 2011 AWS failure and how can downtime be avoided? - Quora
- What caused Amazon AWS’s east coast cluster to go down on April 21, 2011? - Quora
- Sony “rebuilding” PlayStation Network after attack [unrelated to Amazon; I wonder if Sony didn’t have backups?!!!] - Network World
Monday 25 April 2011
- Green across the board. Seems fixed - Amazon Status
- How SmugMug survived the Amazonpocalypse - co-founder Don MacAskill
- Chaos Monkey: How Netflix Uses Random Failure to Ensure Success - ReadWrite Cloud
- Why transparency matters in the cloud - cloud Pundit
- Stop Blaming the Customers - the Fault is on Amazon Web Services - ReadWriteCloud
- PlayStation online crippled for FIFTH day by hackers [unrelated to Amazon] - The Sun
Summary and Lessons for Amazon
- AWS outage timeline & downtimes by recovery strategy - Random Hacks (who also provided the next three links)
- Amazon EC2 / EBS Outage: Lessons learned. A good overall analysis, with recommendations.
- On Cascading Failures and Amazon’s Elastic Block Store. How emergency fail-over code can actually make an outage worse.
- Amazon EC2 outage: summary and lessons learned. RightScale has posted an excellent post-mortem. They note that the outage actually spread to more EBS volumes over time, and link to a long list of related posts.
- Update on 29 April: Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region - Amazon
Update on 9 June 2011: Sony’s problems continued [Unrelated to Amazon]:
- Did Sony shut down PSN because hackers learned to pretend their PlayStations were development machines, bypass security, and freely download games and account info? - TorrentFreak
- Sony announces on 26 April 2011 that the PSN was compromised and user information including names, passwords, addresses and most likely credit card numbers have been stolen - TNW Industry, Sony Blog
- How Sony annoyed the hacker community - Wikipedia.
- PlayStation Network Hack FAQ - Kotaku
- Update on 06 May: The attack revealed security flaws so gaping that Sony has been working for the past two weeks on a completely redesigned system - Mashable
- Update on 14 May: Sony has developer-only access working, just about - Online Social Media
- 2 June: PlayStation Store comes back online, all PSN due back today - electronista
- 6 June: The “Welcome Back” free gifts Sony offered consumers as recompense for the near two month PSN outage are now available - International Business Times
- PlayStation Home Sees Record Activity Levels After PSN’s Return - Gamasutra
- How PSN was hacked - PlaystationLife
- A concise history of recent Sony hacks [17, running at about 2 per week!] - Security Curmudgeon
- Why Does Sony Keep Getting Hacked? - HuffPost
I think that wraps things up.