Defining the Enterprise Cloud Service – Part 4: Endless 9s Reliability

I recently released the first, second and third installments of a six-part blog series about what it takes to have an enterprise-ready cloud service, and the three characteristics that differentiate an enterprise-grade cloud service from a typical consumer service: security, reliability and trust. As a quick refresher, there are five categories to look at when evaluating a cloud service for security, reliability and trustworthiness:

  • Development for the enterprise
  • Endless 9s reliability
  • Benchmarked and audited service
  • Strong encryption throughout
  • Singular focus on the customer

I’ll focus on “endless 9s” reliability in this post. Availability of a cloud service is important. People I know get quite agitated when Facebook or Twitter is unavailable, but imagine the repercussions when a big enterprise such as Genomic Health can’t access a critical application, such as CRM. The phrase “four 9s availability” has been the benchmark for providers delivering critical online or hosted cloud services. Whether its five 9s or three 9s, the punch line is really how providers actually deliver those 9s.

Here is some simple math: I have a choice between 99.9 percent uptime for service A and 99.99 percent uptime for service B. Out of the 8,765 hours (or 525,900 minutes) in a year, service B guarantees 39.44 minutes of uptime more per month, on average. If I am relying of key authentication into my apps, then the decision between service A and service B looks pretty simple.

That extra 0.09 percent of uptime typically comes with some hefty service premiums (sometimes at 10x the cost), however. Let’s ignore cost differentials, and instead complicate the math a bit for maintenance windows for when a service is allowed to be down for patches, upgrades, etc. Now, service A provides 99.9 percent uptime with zero maintenance windows. Service B provides 99.99 percent uptime with 30-minute weekly maintenance windows. Thirty-minute windows are pretty typical, and often they are longer. Using the same available hours per year (8,765), service A seems to be the better choice as it guarantees about 90 more minutes of uptime every month than service B. That’s a significant number, especially for key authentication systems that users all over the world access. A 30-minute weekly maintenance window could black out an entire office in Singapore at a critical time each week. Even if the weekly maintenance window is 10 minutes, service A still wins.

One of the coolest things about working at Okta is getting to work with Adam D’Amico, who is the genius behind our operation architecture. Adam has built not only a scalable architecture with no maintenance windows, but he built it entirely virtualized in AWS. We’ve had two minutes of total downtime in 2012 so far. We upgrade and manage our production service weekly and don’t burden our customers with maintenance windows. What’s more, we provide a better uptime guarantee to our customers than what AWS provides to us. Remember the thunderstorms along the East Coast in July that caused an AWS outage? We lost an entire data region, yet Okta’s service was available the entire time. Our customers didn’t experience any downtime.

Uptime SLAs are important because they provide confidence for customers — but also because reliability is security. Users expect their cloud services (especially critical applications such as identity) to work like a utility: Flip the switch and the lights come on; click a button and users are logged in. The number of 9s in that SLA is important, but what’s equally important is how cloud providers deliver them.