What can we learn from Amazon's cloud failure? |
Connect with TechFlash on our Facebook page for all the latest technology news headlines and commentary, plus information and access to special events, photos from events, promotions and more.
Amazon's had a rough couple of days. The server outage that started late Wednesday night caused a host of service failures for customers who relied on Amazon infrastructure. From portions of the NYTimes and ProPublica to social media sites Quora and Foursquare, cloud-based companies like Heroku to Silicon Valley startups, the outage had a wide-ranging impact.
The collapse is a rarity for the company, whose cloud services provide businesses with a scalable, flexible and affordable way to store and deliver content, and is known for its reliability.
Amazon seems to have the problem (mostly) under control at this point (many sites are back up or have found ways to work around the issue), but the event highlights a fundamental problem pointed out by critics -- that, while a shift to the cloud saves money in IT and infrastructure costs, it leaves businesses vulnerable if and when servers crash, as we have seen over the last day and a half.
And, as Ben Parr from Mashable pointed out, the event revealed that Amazon’s cloud redundancies failed to stop a mass outage. "Its Availability Zones are supposed to be able to fail independently without bringing the whole system down. Instead, there was a single point of failure that shouldn’t have been there," said Parr.
So, now that many sites are back on their feet, what can companies learn from this experience?
Cheezburger Network CEO Ben Huh said that outages like this can be a learning opportunity for companies.
"It's not a catastrophe unless something valuable (like user data) was lost," said Huh. "It's an opportunity to learn about the service provider's weakness and how to design more stable, reliable systems. Services recover very quickly from outages as long as they are relatively short. Long-term outages are another beast."
I also spoke with Margaret Dawson,VP of product management and Ian Huynh, VP of engineering, at Hubspan, a cloud-based service provider here in Seattle.
Dawson and Huynh both said one of the greatest takeaways from the Amazon event is a reminder that companies must take it on themselves to understand the services their cloud provider offers, and to build redundancy (the concept of storing content and having access availability from more than one location) into their applications.
"While the Amazon cloud provides a lot of redundancy functionalities specifically, I don’t believe Amazon cloud ever said they were 100 percent redundant," said Huynh. "So, one question you might want to ask is, 'Was the application built on top of Amazon cloud, is it able to take advantage of some of the failover functionalities that the cloud provides?' Just because the infrastructure itself is redundant, if you build your application without a mindset of being resilient and being able to failover, it’s a moot point there in being able to say that your application is redundant."
"As the hype around the cloud has become so loud, people forget to look under the covers," said Dawson. "They're just thinking 'Oh I'll just throw my storage up there, I’m just going to run this application' and they really need to do due diligence around the company running the application, (and ask) do they also run the infrastructure or run the data center?'
Dawson said businesses looking to move to the cloud need to know who is running the infrastructure and understand the provider's Service Level Agreements(SLA) around uptime, reliability and business continuity. And, of course, companies should have a backup plan to keep their business running if there is a cloud failure -- such as hosting an application on two separate providers. (However, FathomDB founder Justin Santa Barbara pointed out that each Amazon global region or data center has its own rules and features, which can make a switch difficult.) .
In the end, Dawson said people need to think carefully about their business strategy and how the cloud can useful to them, build redundancy into their applications, and then step back and not be afraid to make the move.
"The reality is most cloud providers -- and Amazon would definitely be included in this group -- are going to provide you with an industry best-class capability that would be very challenging for a lot of companies to build and manage themselves."
Any thoughts? Know of anyone locally who was affected? We'd love to hear about it.
(See previous coverage, with updates from Amazon: Amazon servers take down Reddit, Foursquare and others)
If you are commenting using a Facebook account, your profile information may be displayed with your comment depending on your privacy settings. By leaving the 'Post to Facebook' box selected, your comment will be published to your Facebook profile in addition to the space below.