Building your house on a rock

Every great technology company starts with a great product. When you first start the hard work of turning your vision into living, breathing code, you have critical decisions to make about your foundation. It is critical to “build your house on a rock." In years past the decisions were desktop or web app? Oracle or SQL Server? Unix or NT? Now the decisions have become Ruby or Java? Amazon or Rackspace? Single tenant hosted or multi tenant?

Part of you knows you'll have time to rework and undo the early choices. In fact this is unavoidable and you just want to move forward. But part of you also knows some of the decisions will be with you for years and years. You know this from your previous products and jobs -- the ones with the code that was “just that way because it used to have to run on HP UX" or "because Sybase didn't support row-level locking". You know that the early decisions can make or break the product and the company. You struggle to balance the pressure to move fast and get something working, with the time that you think you need to make the right decisions.

Two years ago when we started building Okta, we faced this situation. In this post I'll talk about the choices we made, the reasons behind them and the lessons learned now that we've been running for awhile.

Keep in mind that we made these decisions in the context of our vision - to build a service that is a cloud domain controller for our customers. It has to be massively scalable - it must support tens of thousands of companies and millions of users. It has to be highly reliable. It can't go down, even for planned maintenance. It must be secure; our customers trust it with their most important information.

Cloud Infrastructure-as-a-Service or traditional data center? The first choice we made was between a traditional hosting facility and the more modern Infrastructure-as-a-Service / Cloud environments (including Rackspace Cloud, Amazon Web Services or Microsoft Azure).

We chose to build our service in the cloud using Infrastructure-as-a-Service (IAAS). Many consider this choice and understand the cost advantages of using the cloud. For example, we quickly had production, development and stage environments, totalling nearly 100 hosts at a very manageable monthly cost, especially in the early days.

But most underestimate the primary advantage of building on the cloud and make the wrong assumption about the “time to market and time to value” that IAAS gives you.

The primary reason we chose IAAS was that doing so allows you (and in some cases forces you) to build a more robust and scalable service. IAAS is based on commodity systems and horizontal scale. Nobody is offering “mainframes in the cloud”. It forces you to build your systems to take advantage of the horizontal capabilities available and avoid any supposedly “Highly Available” components like expensive load balancers and shared-state, clustered databases. IAAS vendors offer geographically diverse physical data centers. Providing Disaster Recovery and cross-geography active-active support is far easier in this environment. This is a huge advantage of IAAS. Most people don’t appreciate it when they’re making the decision.

But here’s the incorrect assumption many make about building on IAAS. Many assume that it is “faster to get up and running”. But opposite is true. This is worth a future post of its own. I call it the Irony of the Cloud.

IAAS is a relatively new and evolving market and technology. One wise choice we made was to avoid proprietary APIs and systems that locked us into a particular vendor. This gives us flexibility as our requirements evolve.

Multi tenancy at which point on the spectrum? A good way to think about the different ways to implement multi-tenancy is along a spectrum. At one end is a a completely separate storage, database, app server and network for every customer. This is very similar to on-premise single tenant, but is run in a 3rd party data center. Tenant isolation is done physically at every level. At the opposite end of the spectrum, everything is shared and tenant isolation is done high up in the software stack at the application layer.

The latter approach has worked well for the likes of Salesforce.com giving them tremendous cost advantages and allowing them to innovate quickly. Different vendors implement at different points along this spectrum - sharing storage but not the db, sharing the db at the server level but not at the schema level, sharing application servers but different databases, etc. With the advances in IAAS and virtualization over the last 5 years, there is an important new point on this spectrum giving you an option you didn’t have before.

We considered this approach at Okta. The thinking was we could build a system that would manage thousands of virtual machines - a separate one for each individual customer. We’d utilize the virtual machine for tenant isolation. The upside of this is we’d have to spend less time implementing multi tenancy higher in the application layer. However, there are significant downsides to this approach.

First it requires a significant amount of effort to write software to manage thousands of Virtual Machines. They need to be provisioned, started, stopped and monitored. And this software would likely be highly coupled to a particular IAAS or virtual machine vendor. (see comments about avoiding lock in above).

Second, virtual machines have a fixed cost in memory, storage and processing; costs that you pay whether you’re buying the physical hardware, or using IAAS. When you spin up thousands of virtual machines this fixed cost is much more expensive than than the relatively small fixed cost for the small number of hosts you need when you implement multi tenancy in the application layer.

Finally, if each customer has their own virtual machine, what do you do when a customer gets too big to run on a single virtual machine and its underlying hardware? The only answer is to split the customer across multiple virtual machines. Now you not only have to write complex, vendor-specific software to manage thousands of virtual machines, but also software to manage a cluster of virtual machines for large customers.

We decided to implement our multi tenancy above the virtual machine. We took an approach similar to Salesforce.com and implemented multi tenancy in the application layer. This decision has served us well. We are able to support our customers with a very affordable cost structure.

SQL or NOSQL at the Data Layer? 5 years ago, deciding on the the technology for the data layer of your service was easy; just pick your favorite SQL database. Things have changed. Today there are viable NOSQL solutions like Hbase, Cassandra and Mongodb.

We looked at them all but decided that since our service was new and evolving we should start with a more well known and understood relational system. We’d avoid the learning curve while we got our service working.

Now that we have a better idea of the functionality our customers need, we’re looking again at the NOSQL alternatives to give us more scalability and reliability.

Which SQL Database? We never really considered Oracle or any other commercial db. Cost was one factor, but more important was our philosophy to build a horizontally scalable system. When you pay a lot for your relational database, you naturally tend to lean on it for more and make it a single point of failure in your architecture. You wait longer to shard your data and scale things horizontally. You rely on their “scalablilty” and “Highly Available” features like clustering. These buy you time but eventually you’ll wish you went horizontal from the start.

We considered both MySQL and Postgres and in the end chose MySQL, which has served us well. It has closed the functional gap in recent years and has a more broad community of support. Our databases have performed well and we’ve received great support. It’s also easy to find people who are familiar with working on it.

Web app server programming language: Java, Ruby, Scala, Perl, PHP? With this choice we optimized on two things. First was a well-understood, stable runtime environment. Java is at least equal to or better than the alternatives here. I still have nightmares about working with languages with “newer” runtimes a few years ago.

Second was picking a language with broad tool and framework support, and one that most, if not all great engineers are familiar with. Java is tops in both these accounts -- especially in the area of testability and test frameworks -- something very important to us.

Looking ahead, Scala is interesting. It has a rock solid runtime - it compiles to JVM bytecode - and powerful, clean syntax. But it is still lacking in broad tool and framework support. Although folks are working to change that.

Web Container: Jboss, Tomcat, Resin? At my previous job we used Caucho Resin and it worked well. That choice predated broad Tomcat adoption though. At Okta we've gone with Tomcat, the most broadly adopted and well understood java container. We’ve been very happy with our choice.

Web framework. Spring, JSF, Struts, Wicket, Hibernate? This was one of the more interesting decisions we made. The first debate was whether we should use a framework at all, or roll our own. The argument against a framework goes something like -- sure, frameworks make the first 80% of what you’re doing simple, but the last 20% - the really tricky stuff - is so difficult to fit into the framework that you’re better off rolling your own from the start. Another argument is that frameworks take time to learn and are harder to debug - further degrading any productivity advantage they may provide in the first 80%.

The counter argument (pro framework) is that it’s really more like 95% of work they greatly simplify. Also the good frameworks provide hooks and call outs to make the last 5 % (again the really tricky stuff) reasonably straight forward.

We first built a prototype of our service without using any frameworks. We wanted to quickly validate some basic assumptions and the first couple engineers weren’t experts in any of the frameworks. We wanted to avoid the learning curve. Through the prototype phase, we learned that much of the work we needed to do was supported by frameworks (more like 95% vs 80%). As we brought on more engineers, the learning curve for our own framework was equal or greater than the well-established Java frameworks that most people know.

When we decided to adopt frameworks, we first chose Hibernate. This was a no-brainer as it is basically the only choice. It was interesting that there are so many different choices at the MVC / UI layer, but only Hibernate at the object relational mapping layer.

We briefly tried JSF but it wasn’t a good fit. It was too heavy weight and its server-side component model doesn’t fit well with modern AJAX style applications.

We chose the Spring framework. It is mature, broadly adopted and supported. Its MVC abstraction is clean and lends itself to highly testable code. There is a learning curve for folks that haven’t worked with it, but overall it has met our needs well and we’re happy with the choice.

The big lesson here is when choosing a framework it is important that it makes nearly everything you want to do easier, and everything else possible and reasonable. I believe frameworks need to mature and to have their designs honed over a number of years before they can do this. Pick a newer framework at your peril.

Another lesson - which may be obvious from the choices we’ve made - is to pick an open source framework. It’s invaluable to be able to look at the source code to see exactly what the framework is doing. This removes much of the mystery and speeds your implementation -- especially in the areas where you’re working just at the edge of what the framework supports.

Javascipt Library: Jquery, YUI, GWT? For desktop clients, there are other client technologies besides HTML and Javascript (Flex, Silverlight, etc) but it was key that our site had an innovative design and was highly usable - much like a consumer website. HTML and Javascript are definitely the way to go here. Mobile is a different story with all the innovation going into native platform SDKs.

We looked at YUI and GWT but picked JQuery and it has worked out well. It is lightweight and easy to learn. It was a better fit as it focuses on plumbing of AJAX interactions and navigating the DOM vs actual UI elements - those we wanted more control over. It also has a good ecosystem of plugins which we’ve taken advantage of.

This is a brief summary of the decisions we’ve made, the reasoning behind them and a retrospective on how they’ve worked out. One thing I’ll say is that we try really hard not to be religious about technical decisions. We try to pick the best tool after weighing all the options. One of the many advantages of building a cloud service vs installed software is that we have more ability to change course if we make the wrong choice. We haven’t had to make any major changes, but it makes me feel better knowing we’ll be able to once that inevitable time comes. We can not only build our house on a rock, but keep it there over time.