Simple Go TLS, or SGT as we’re calling it, is an osquery endpoint management server written in Go and backed by AWS services, designed to take advantage of the native scaling, performance, and reliability of the AWS cloud environment.
At Okta, osquery has become a powerful part of our security monitoring toolset. For those unfamiliar, osquery is an open source endpoint monitoring tool, which exposes the operating system as a SQL database. This allows teams to perform queries on the operating system to gather information. When we originally started using osquery, we managed our deployments and configurations using Chef, logging locally and shipping our logs off of machines via a log forwarder. As our deployments grew rapidly, so did our need for more granular control, faster configuration change, and ad-hoc queries.
To meet these needs, we needed to change the way we used osquery, moving from static configurations provided by Chef, to managing osquery via a TLS server endpoint. In this architecture, osquery clients call back to a central server, which provides them with the configuration, queries, etc. that they will run. Because clients check in frequently, it becomes possible to change configurations, introduce new query packs or run on-demand queries to meet monitoring needs or perform live-forensics.
Osquery servers are already widely available from a good number of companies, and many are already open-source, but after doing PoCs for all of them we found that there wasn't an ideal solution available for the requirements we had in mind:
- Built for the cloud
- Keep system maintenance manageable and allow us to describe our infrastructure as code
- Easily scale to thousands of clients, automatically
- Provide a robust API to script automation and integrations against
- Utilize AWS services to allow us to maintain as few servers as possible
- Support multiple environments (such as Prod, Dev, Corp, etc.)
- Provide a logging mechanism for clients outside of the office to send logs securely back to our team
- Easily integrate with other security tooling to provide analysis and alerting, for example, Stream alert
- Easy rotations of stored secrets with zero downtime
We decided it would be worth it, in the long run, to invest in building something that met our needs explicitly rather than trying to force an already established project into the requirements we had. Today we're now releasing the fruits of those labors to the security and open-source communities.
To get started right now with Okta’s osquery server, head over to Github. Read on for an overview of the architecture and design, as well as the motivations behind our project.
Design and Architecture
When our team started this project, we had a few specific guidelines and goals to achieve. As a security team, we're constantly trying to improve our security posture, which means we're never really "done". There's always something else we can be improving. Thus any time we spend maintaining a system, managing databases and servers, means less time on building more and better security.
Bottom line, the more we can push our tooling into managed services or serverless computing, the more we can focus on what’s critical to our business.
Because of this type of high-volume request usage, we needed something that could handle hundreds of requests concurrently without consuming a significant volume of system resources. This type of performance requirement is the reason we finally settled on Go as our language of choice. Goroutines are lightweight and bring concurrency to a project with an excellent and easy to use set of APIs.
Initially, we contemplated using Lambda functions behind API Gateway in AWS, but while Lambda functions are great for running short-lived tasks, there is an inherent amount of overhead associated with them because code must be loaded every time a function is initialized. Each call to the API Gateway means a 1:1 call to a Lambda function. Due to this limitation, we settled for using a more traditional load balancer and EC2 instance architecture to host the API, which allowed us to take advantage of the fantastic performance of goroutines in handling requests.
In osquery's client-server architecture, osquery clients make frequent calls back to the server's API. It’s not uncommon for a single client to call into the server once every 10 seconds looking for new ad-hoc queries to run. These requests are small in size and require very little computer power to handle, but at even 1000 clients (a relatively small deployment in cloud architectures) this would amount to 6k requests/minute. Add in requests from other automations, another 2-3k clients, and normal user interaction, and suddenly Lambda + API Gateway starts to look like a major performance bottleneck, and a not-inexpensive one.
Backing our EC2 instances, the rest of SGT is built upon managed services from AWS, giving us excellent performance, scalability, and reliability:
- DynamoDB stores all configurations, packs, queries, clients, users and distributed queries, scaling up and down as needed to meet demand
- S3 stores our binaries, which are copied to our EC2 instances when the instances are launched. S3 also provides secondary log storage.
- SSM parameter store provides storage for app secrets
- Elasticsearch Service provides log storage and searching
- Kinesis Data Firehose ships logs over TLS to both S3 and Elasticsearch from anywhere, regardless of network
- A load balancer and autoscaling group complete the package, scaling and managing our instance needs and making sure instances are healthy
The final piece of the project is the code which describes and builds the SGT infrastructure. To allow us to maintain and tie together all these pieces, we turned to Terraform to build and update SGT. All AWS resources which are utilized are described in Terraform code and configured through a single config file for each environment we want to deploy to. Terraform makes standing up new environments simple. However, managing multiple environments and orchestrating the pieces which built SGT can get a bit complicated. To enable easy deployment, we wrapped our Terraform modules and scripts in a deploy command. This allows us to stand up the entire deployment with a single command:
./sgt -deploy -env prod -all
Today at Okta
We've come to rely on osquery for the incredible visibility it can provide across our environments. From monitoring process and IPtables changes to installed packages and disk encryption tracking, osquery allows us an incredible amount of insight into what our fleets of machines are doing. Adding SGT has allowed us to move to a fast, flexible, and responsive osquery deployment.
We hope that by releasing SGT, we can help other teams quickly and reliably deploy a management server for osquery, allowing anyone to take advantage of the additional benefits having a TLS server provides.
SGT is still evolving as a project, and we’re actively seeking contributions to improve it. We’d love for anyone interested to fork the project on GitHub and submit pull requests. We hope this will turn into a project other security teams can rely on.