The cloud is growing, but cloud outages are nothing new. And neither are we. InformationWeek was first founded in 1985 and our online archives go back to 1998. Here are just a few lowlights from the cloud’s worst moments, dug up from our archives.
Apr. 17, 2007 / In Web 2.0 Keynote, Jeff Bezos Touts Amazon’s On-Demand Services, by Thomas Claburn — “Asked by conference founder Tim O’Reilly whether Amazon was making any money on this, Bezos answered, ‘We certainly intend to make money on this,’ before finally admitting that AWS wasn’t profitable today.”
(As a reminder, folks, here in 2022, AWS is now worth a trillion dollars. )
Aug. 12, 2008 / Sorry, the Internets are Broken Today, by Dave Methvin and Google Apologies for Gmail Outage, by Thomas Claburn — After a string of clunky disruptions across Microsoft MSDN forums, Gmail, Amazon S3, GoToMeeting, and SiteMeter, Methvin laments: “When you use a third-party service, it becomes a black box that is hard to verify, or even know if or when something has changed. Welcome to your future nightmare.”
Oct. 17, 2008 / Google Gmail Outage Brings Out Cloud Computing Naysayers, by Thomas Claburn — Because the outage appears to have lasted more than 24 hours for some, affected paying Gmail customers appear to be owed service credits, as per the terms of the Gmail SLA. As one customer said: “This is not a temporary problem if it lasts this long. It is frustrating to not be able to expedite these issues.”
June 11, 2010 / The Cloud’s Five Biggest Weaknesses, by John Soat — “The recent problems with Twitter (“Fail Whale”) and Steve Jobs’ embarrassment at the network outage at the introduction of the new iPhone don’t exactly impart warm fuzzy feelings about the Internet and network performance in general. An SLA can’t guarantee performance; it can only punish bad performance.”
[In 2022, a cloud SLA can accomplish basically nothing at all. As Richard Pallardy and Carrie Pallardy wrote this week, “Industry standard service level agreements are remarkably restrictive, with most companies assuming little if any liability.”]
April 21, 2011 / Amazon EC2 Outage Hobbles Websites, by Thomas Claburn / April 22, 2011 / Cloud Takes a Hit, Amazon Must Fix EC2, by Charles Babcock / April 29, 2011 / Post-Mortem: When Amazon’s Cloud Turned on Itself, by Charles Babcock — The “Easter Weekend” Amazon outage that impacted Yard, Foursquare, Hootsuite, Heroku, Quora, and Reddit among others. Babcock writes: “In building high availability into cloud software, we’ve escaped the confines of hardware failures that brought running systems to a halt. In the cloud, the hardware may fail and everything else keeps running. On the other hand, we’ve discovered that we’ve entered a higher atmosphere of operations and larger plane on which potential failures may occur.
“The new architecture works great when only one disk or server fails, a predictable event when running tens of the thousands of devices. But the solution itself doesn’t work if it thinks hundreds of servers or thousands of disks have failed all at once, taking valuable data with them. That’s an unanticipated event in cloud architecture because it isn’t supposed to happen. Nor did it happen last week. But the governing cloud software thought it had, and triggered a massive recovery effort. That effort in turn froze EBS and Relational Database Service in place. Server instances continued running in U.S. East-1, but they couldn’t access anything, more servers couldn’t be initiated and the cloud ceased functioning in one of its availability zones for all practical purposes for over 12 hours.”
Aug. 9, 2011 / Amazon Cloud Outage: What Can Be Learned? by Charles Babcock — A lightning strike in Dublin, Ireland knocked Amazon’s European cloud services offline Sunday and some customers were expected to be down for up to two days. (Lightning will make an appearance in other outages in the future.)
July 2, 2012 / Amazon Outage Hits Netflix, Heroku, Pinterest, Instagram, by Charles Babcock — Amazon Web Services data center in the US East-1 region loses power because of violent electrical storms, knocking out many website customers.
July 26, 2012 / Google Talk, Twitter, Microsoft Outages: Bad Cloud Day, by Paul McDougall / July 26, 2012 / Microsoft Investigates Azure Outage in Europe, by Charles Babcock / March 1, 2012 / Microsoft Azure Explanation Doesn’t Soothe, by Charles Babcock — Google reported its Google Talk IM and video chat service was down in parts of the United States and across the globe on the same day Twitter was also offline in some areas, and Microsoft’s Azure cloud service was out across Europe. Microsoft leader’s post-mortem on Azure cloud outage cites a human error factor, but leaves other questions unanswered. Does this remind you of how Amazon played its earlier lightning strike incident?
Oct. 23, 2012 / Amazon Outage: Multiple Zones a Smart Strategy, by Charles Babcock — Traffic in Amazon Web Services’ most heavily used data center complex, U.S. East-1 in Northern Virginia, was tied up by an outage in one of its availability zones. Damage control got underway immediately but the effects of the outage were felt throughout the day, said Adam D’Amico, Okta’s director of technical operations. Savvy customers, such as Netflix, who’ve made a major investment in use of Amazon’s EC2, can sometimes avoid service interruptions by using multiple zones. But as reported by NBC News, some Netflix regional services were affected by Monday’s outage.
Okta’s director of technical operations told Babcock that they use all five zones to hedge against outages. “If there’s a sixth zone tomorrow, you can bet we’ll be in it within a few days.”
Jan 4, 2013 / Amazon’s Dec. 24 Outage: A Closer Look, by Charles Babcock — Amazon Web Services once again cites human error spread by automated systems for loss of load balancing at key facility Christmas Eve.
Nov. 15, 2013 / Microsoft Pins Azure Slowdown on Software Fault, by Charles Babcock — Microsoft Azure GM Mike Neil explains the Oct. 29-30 slowdown and the reason behind the widespread failure.
May 23, 2014 / Rackspace Addresses Cloud Storage Outage, by Charles Babcock — Solid state disk capacity shortage disrupts some Cloud Block storage customers’ operations in Rackspace’s Chicago and Dallas data centers. Rackspace’s status reporting service said the problem “was due to higher than expected customer growth.”
July 20, 2014 / Microsoft Explains Exchange Outage, by Michael Endler — Some customers were unable to reach Lync for several hours Monday, and some Exchange users went nine hours Tuesday without access to email.
Aug. 15, 2014 / Practice Fusion EHR Caught in Internet Brownout, by Alison Diana — A number of small physician practices and clinics sent home patients and staff after cloud-based electronic health record provider Practice Fusion’s site was part of a global two-day outage.
Sept. 26, 2014 / Amazon Reboots Cloud Servers, Xen Bug Blamed, by Charles Babcock — Amazon tells customers it has to patch and reboot 10% of its EC2 cloud servers
Dec. 22, 2014 / Microsoft Azure Outage Blamed on Bad Code, by Charles Babcock — Microsoft’s analysis of Nov. 18 Azure outage indicates engineers’ decision to widely deploy misconfigured code triggered major cloud outage.
Aug. 20, 2015 / Google Loses Data: Who Says Lightning Never Strikes Twice? by Charles Babcock — Google experienced high read/write error rates and a small data loss at its Google Compute Engine data center in Ghislain, Belgium, Aug. 13-17 following a storm that delivered four lightning strikes on or near the data center.
Sep. 22, 2015 / Amazon Disruption Produces Cloud Outage Spiral, by Charles Babcock — Amazon DynamoDB failure early Sunday set off cascading slowdowns and service disruptions that illustrate the highly connected nature of cloud computing. A number of Web companies, including AirBnB, IMDB, Pocket, Netflix, Tinder, and Buffer, were affected by the service slowdown and, in some cases, service disruption. The incident began at 3 a.m. PT Sunday, or 6 a.m. in the location where it had the greatest impact: Amazon’s most heavily trafficked data center complex in Ashburn, Va., also known as US-East-1.
May 12, 2016 / Salesforce Outage: Can Customers Trust the Cloud?, by Jessica Davis — The Salesforce service outage started on Tuesday with the company’s NA14 instance, affecting customers on the US west coast. And while service was restored on Wednesday after nearly a full day of down time, the instance has continued to experience a degradation of service, according to Salesforce’s online status site.
March 7, 2017 / Is Amazon’s Growth Running a Little Out of Control? by Charles Babcock — After a five-hour S3 outage in US East-1 Feb. 28, AWS operations explains that it was tougher to restart its S3 index system this time than the last time they tried to restart it.
Writes Babcock: “Given the fact that the outage started with a data entry error, much reporting on the incident has described the event as explainable as a human error. The human error involved was so predictable and common that this is an inadequate description of what’s gone wrong. It took only a minor human error to trigger AWS’ operational systems to start working against themselves. It’s the runaway automated nature of the failure that’s unsettling. Automated systems operating in an inevitably self-defeating manner is the mark of an immature architecture.”
Fast Forward to Today
As Sal Salamone detailed neatly this week, in his piece about lessons learned from recent major outages: Cloudflare, Fastly, Akamai, Facebook, AWS, Azure, Google, and IBM have all had calamities similar to this in 2021-22. Human errors, software bugs, power surges, automated responses having unexpected consequences, all causing havoc.
What will we be writing 15 years from now about cloud outages?
Maybe more of the same. But you might not be able to read it if there’s lightning in Virginia.