There’s never a good time for a cloud outage. But when there’s an 8-hour long cloud outage in December — the height of holiday shopping season — there are going to be some discussions among executives, especially those at retail companies. CIOs are going to be asking what went wrong, how quickly it can be fixed, and what needs to be changed so it never happens again.
Such was the case on Tuesday, Dec. 7, 2021, when Amazon Web Services experienced an outage and took down much of the infrastructure around holiday shopping with it. Certainly, Amazon’s own enormous retail operation was impacted including the company’s Whole Foods grocery business. And so were all the vendors that rely on Amazon’s marketplace to sell and ship their products. But there were any number of other companies that rely on AWS for other purposes who were hit by the outage, too.
The outage came at a time when many organizations had shifted more workloads to the cloud as a way to deal with the complications of the pandemic, and that made this cloud outage — and will make other cloud outages — hurt even more.
“We definitely had a lot of customers who panicked,” says Brent Ellis, Forrester Research VP and analyst who follows cloud resiliency. “This was in the middle of Christmas season. There were retailers who couldn’t sell things. There were banks that couldn’t process transactions, largely because mobile was down.”
Enterprises’ Pandemic Pivot to Cloud
Analyst firm Omdia estimates that in 2019 about 25% of workloads were running in the cloud. When the pandemic hit, that number shot up to nearly 50%. Today it’s dropped back to about 44%, according to Roy Illsley, chief analyst for IT ecosystems and operations at Omdia. It makes sense that CIOs are paying more attention to cloud resiliency now that so much more of their operations run there.
That was heightened further after the December 7, 2021 AWS outage, says Ellis.
“We got a number of inquiries asking the question: Do we need to be resilient across multiple clouds or multiple regions?” Ellis says. “For most, probably 90-95% of businesses, the costs and technical effort in doing that scares them away from it.”
For instance, he says, moving a Windows Server from AWS to Azure requires a change in the virtual machine footprint, the connection to different types of resources, and other reconfigurations for cloud-based networking. It’s not the same across every cloud.
“There are a lot of infrastructure primitives that are just managed differently between the clouds,” he says.
Another option is to use an abstraction layer or containers such as VMware Cloud Foundation or Kubernetes.
“VCF is probably the easiest one to implement because you don’t have to do a lot of change to your actual servers, but it’s also pretty costly because you’re not only paying for the compute resources in the cloud. You are also paying for the VMware licensing,” Ellis says.
The bottom line is that it can be much more expensive and much more complex to set up your operation to use multiple clouds to protect yourself from a catastrophic failure of one of those clouds. After investigating what it all entails, some CIOs may choose to take the hit of an outage instead, particularly if revenue is not impacted by the outage.
Cloud Providers Improve Resiliency
Cloud providers themselves have also been working to improve their own resiliency. It is possible to set up your operation to have it failover from one AWS region to another AWS region if the first region fails. But that’s something that will cost you extra, and it won’t be included in your basic cloud contract.
For retailers operating ecommerce sites in December, such a set up likely makes sense. They stand to lose significant revenues for every hour their sites can’t process orders, says Gartner VP and analyst Sid Nag.
However, other organizations that don’t stand to lose significant revenue from an outage may make a different choice. As organizations weather more outages, they gain a clearer understanding of the tradeoffs between cost and risk when it comes to cloud resiliency.
“One of the frustrations that CIOs find is that they seem to expect resiliency and then get a bit upset when they haven’t gotten what they expected,” Illsley says. “There’s a balancing of expectations of what’s included in the cloud, what’s excluded, and where additional costs and risks come in.”
But Illsley says that IT leaders’ level of understanding is maturing when it comes to cloud resiliency.
What You Get for Your Money
“CIOs are becoming more aware of what you get for your money in the cloud, but equally, what are the options you’ve got to make yourself more resilient in the cloud,” Illsley says. “That’s the journey that they’ve been on.”
That could mean hosting operations in your own data center and setting up a failover to the cloud. It could mean that you pay more for a failover within the regions of a single cloud service provider such as AWS. It could mean that you use a particular disaster recovery vendor that works with your chosen public cloud provider.
“Which one is the best? The answer is, it depends on what you want, what your issues are, where you are in the world, and what you are doing,” says Illsley. “You’ve got to make a strategy for yourself that fits your budget at that time.”
Negotiating Cloud Contracts for Resilience
One of the things that CIOs may want to pay close attention to going forward is the negotiation of their contracts with cloud providers. Every cloud provider has an SLA (service level agreement), says Ellis. If their services are down for 8 hours, you won’t need to pay them for those 8 hours. (The process of getting compensated for the cost of that downtime is different for each provider.) However, they are not going to pay you for revenue you lost out on while their services were down. That’s not part of the deal for most organizations.
“One of the things that very large businesses try to do when they are negotiating with the cloud provider is to negotiate some sort of shared risk,” Ellis says. “Maybe it would be that the cloud provider would be responsible for revenue loss up to 20% or something like that. Whether the cloud provider agrees to that provision usually depends on the scale of the contract. But that’s what people at the enterprise scale are trying to do to mitigate against a cloud-based outage.”
Still Better Than Your On-Premises Datacenter
Given the attention to cloud resilience, one might think that cloud outages represent a big problem. If you are the CIO of Target in December, maybe it is. But it helps to keep things in perspective.
Ellis notes that there are 8,760 hours in a given year, and the AWS outage in December was roughly eight hours. Do the math and “you end up with 99.9% availability for the year, which is still better than most internal private data centers.”
That’s probably why no one is talking about pulling out of AWS.
Rather, the conversation today is “How do we architect around it,” Ellis says.
What to Read Next:
Cyber Resiliency: What It Is and How To Build It
Quick Study: Cyber Resiliency and Risk