So, you’ve found the cloud application of your dreams. It does everything you ever thought you could want and ten things you didn’t know you wanted but now can’t imagine living without. It took less than 13 seconds to fully configure, and after rolling it out you found that several users had placed pictures of your IT team on their desk next to pictures of their kids and spouses. You feel pretty good about your purchase.
Then it happens…
At 3:30 ET your helpdesk starts lighting up. Your cloud app is serving error pages intermittently and 20 minutes later, the app goes down completely. Users are asking you for answers, the vendor’s support page is silent and after sitting on hold for 20 minutes, a tech tells you that they hope to have more information soon. Eight hours later, you get a form email telling you that the service is partially up, but your users’ data won’t be available for another 8 hours. Twenty-four hours after that, the data returns, you’ve lost two days of user productivity, and your IT team’s picture moves from desks to dart boards.
In the cloud world, you as the IT person hand over control of critical application infrastructure to your partner, expecting that the cloud vendor can manage that infrastructure better and more cheaply that you can.
But not every cloud vendor delivers on this promise, and it’s equally important to evaluate each application based on its architecture as well as its feature set.
Now imagine you are evaluating not just one application, but a core cloud infrastructure service that will be the central point through which you secure and manage access to ALL your applications.
An on demand identity & access management service is one of those core cloud infrastructure services and should be:
- Built for Web Scale – the service must scale up and down seamlessly with your needs.
- Always Available – the service must be architected for zero downtime. No maintenance windows required.
- Secure – the service must be more secure than anything you could build and operate on your own.
- Constantly Evolving – the service must deliver rapid innovation that enables new capabilities and insulates you from the constantly changing IT landscape.
At Okta, we take this to heart and have built the software, operationalized the processes and hired the people it takes to deliver on all of these fronts.
Emiliano wrote a great post about the infrastructure we have put in place that enables us to deliver innovative feature updates and new application integrations weekly.
In this post, I am going to focus on the investments we have made around scalability and availability, and in a future post we will dive into specifics on what we do across the board to ensure the highest level of security in our service.
One of the most critical aspects of Okta’s architecture is that it is completely multitenant. With multi-tenancy, all of our customers share the same underlying environment. Because it is shared, we can make the infrastructure extremely robust in terms of scale, redundancy, monitoring and processes.
The entire company is focused on making this one environment perfect.
The overall picture of the service looks like this:
The system consists of a front-end tier containing proxy load balancers and firewall services, an app tier, where our software mostly runs, a set of databases, and of course hosting in AWS. It is designed for high scale, high throughput, and 100% availability.
The core design elements of the system are:
Every component other than the databases is completely stateless. As a result, above the database tier, any server in the stack can handle any request.
That means that all of the components of our system can be scaled up at will simply by spinning up new VMs in our Amazon Web Services hosting platform and any individual component can fail at any time and will simply be routed around to one of several other active systems.
Functionally Optimized Databases
Throughout the system we generally like to use the right tool for the job. When it comes to databases we follow that rule and have segmented the databases based on access patterns and types of data being stored ensuring that we can meet stringent requirements in each dimension without compromising the others. In our case that breaks out as follows:
- Entity DB: Holds extensible configuration and application meta-data – medium data size, medium throughput requirements so a flexible schema is needed.
- Session DB: Small data size, high throughput requirements, needs to support schema changes without going down.
- Transaction/Logging DB: High data size, high write throughput / medium read throughput.
- Reporting: High data size, medium write throughput, medium read throughput, de-normalized to support querying on many dimensions.
It’s complicated, but because Okta is multitenant, we can do this once and do it right.
Data Replication and Backup
While it’s possible to add and remove stateless components above the data tier at will that cannot be true of the databases. Having built this tier on AWS EC2 this is especially challenging since instances can disappear at any time. So how do we account for that?
- We run a master-master configuration with read replicas so there is no single point of failure. If one master goes down the other is promoted.
- Replicas are live across six availability zones and we have a time delayed replica in a seventh.
- For further redundancy a full replica of the entire system is running in a geographically separate datacenter.
- For backups we do incremental EBS snapshotting to S3 and take full portable backups in case we need to restore outside of AWS.
Zero Planned Downtime
Okta must be available for any other app to be accessed and thus there’s no good time to be down. With this in mind, we designed Okta for zero planned downtime. All upgrades happen in place with minimal impact on administrators and no impact on end users.
Most services try to solve for zero downtime in one of two ways: they either require engineers to write code that can handle reading writing to multiple versions of the DB or they have a read-only mode. The first approach creates a lot of inefficiencies in the code and reduces agility.
We have taken the second approach, having a read-only mode, a step further by supporting a read-only mode at all layers of the stack. Combining that with a deployment orchestration process that ensures upgraded app servers only talk to upgraded versions of the DB allows us to maintain service availability while also delivering continuous innovation.
Able to Handle the Unexpected
Unfortunately problems occur that we can’t plan for. Networks go down, storage fails, software breaks in unexpected ways. Critical cloud services like Okta need to be built and operated with the expectation that these problems will occur.
The first layer of defense against the unexpected is robust instrumentation and monitoring for all components of the system. We split this into two categories: external monitors and internal monitors.
For external monitoring, Okta uses two third party services with globally distributed test agents to constantly monitor the Okta application. This provides us with a constant feed with real data on how our service is operating. Because we are fully multitenant, the data is real and applicable to all of our customers.
If the monitors say we’re up, we’re definitely up. If they say we’re down or slow, we’re definitely slow. Any problem seen from multiple monitors results in a notification to our operations team and an immediate response.
We use internal monitors to show us not only when things are having problems, but more importantly, what is having problems. They are more sensitive than the external monitors, so they frequently give us warning of problems before they affect site performance or availability. Okta uses internal monitors on all subsystems, and instruments all of our software components for maximum visibility.
The last line of defense is still the read-only mode that we mentioned earlier and allows the okta service to stay up even when unplanned disasters strike. For example, even if AWS suffered a multiple availability zone outage where all of our master DBs were running, the service would still function running on the read-only replicas.
Is Your Cloud Service Built Right?
As a customer looking at cloud services, how do you make sure your vendors have built their service “right”?
- Is it multitenant? If not, is your tenant fully monitored? Who gets beeped if it’s slow or down?
- How many data centers or availability zones do they operate in? One is unacceptable. Two is dangerous. Three or more is good.
- How do they replicate data across datacenters? Is it in real time?
- How do they fail over from one DC to another? Will all of your data, including configurations and reports, be fully up to date?
- How do they upgrade the service? Does everyone get upgraded at once? Be concerned if the vendor is trying to manage several patch versions at once.
- How do they scale up? How do they handle spikes in load?
- Is deployment fully automated? Is testing fully automated? Manual processes are more prone to failure. Avoid services without sufficient automation.
- How long does it take to upgrade to a newer version (for all customers)? If it’s more than an hour, get nervous. It means that something is not right.
Be sure to read our CEO Todd McKinnon’s piece today in Forbes about what CIOs should look for when evaluating cloud vendors.
At Okta we have spared no expense to ensure that our service is secure, reliable and will scale with the needs of our customers and in fact we publish our availability stats monthly. Want to learn more? We’d love to hear from you.