I recently released the first, second and third installments of a six-part blog series about what it takes to have an enterprise-ready cloud service, and the three characteristics that differentiate an enterprise-grade cloud service from a typical consumer service: security, reliability and trust. As a quick refresher, there are five categories to look at when evaluating a cloud service for security, reliability and trustworthiness:
I’ll focus on “endless 9s” reliability in this post. Availability of a cloud service is important. People I know get quite agitated when Facebook or Twitter is unavailable, but imagine the repercussions when a big enterprise such as Genomic Health can’t access a critical application, such as CRM. The phrase “four 9s availability” has been the benchmark for providers delivering critical online or hosted cloud services. Whether its five 9s or three 9s, the punch line is really how providers actually deliver those 9s.
The East Coast is still reeling from a powerful storm that swept through the region on Friday, but the storm’s effects reverberated beyond the devastation in the Mid-Atlantic. On Friday, a lighting storm partially knocked out one of Amazon’s AWS availability zones in Virginia — and companies all over the globe felt the pinch. Netflix, Instagram, Pinterest and many other cloud-based companies that are built on AWS were down for significant periods of time or had sporadic availability issues.
The fact is that the downtime these companies — and their customers — experienced didn’t need to occur. And the blame shouldn’t be placed on the infrastructure that Amazon provides, nor is it an indicator that the public cloud is any less reliable than any other IT infrastructure.
The reaction to the AWS outage is a reminder that thousands of businesses and millions of users are consuming public cloud services based on AWS. As service providers that are running on that AWS infrastructure, we all have the responsibility to make the appropriate level of investment in our software and operations to ensure an acceptable level of reliability for our customers.
So, you’ve found the cloud application of your dreams. It does everything you ever thought you could want and ten things you didn’t know you wanted but now can’t imagine living without. It took less than 13 seconds to fully configure, and after rolling it out you found that several users had placed pictures of your IT team on their desk next to pictures of their kids and spouses. You feel pretty good about your purchase.
Then it happens…
At 3:30 ET your helpdesk starts lighting up. Your cloud app is serving error pages intermittently and 20 minutes later, the app goes down completely. Users are asking you for answers, the vendor’s support page is silent and after sitting on hold for 20 minutes, a tech tells you that they hope to have more information soon. Eight hours later, you get a form email telling you that the service is partially up, but your users’ data won’t be available for another 8 hours.