We are looking for an experienced Site Reliability Engineer to join our Technical Operations team. At Okta, we are "Always On." The core of that starts with this team, ensuring that customers never worry about the Okta service. They strive to build the most reliable and performant systems on the planet.
As a member of the FAST team at Okta, you'll be at the center of our commitment to Always On. Our responsibilities span a number of Okta's most crucial services, and we have significant ownership of our customer-facing infrastructure. We're a collaborative, supportive, and highly skilled team of engineers who take our role seriously, and craft tooling and playbooks to meet Okta's legendary reliability.
What You'll Do:
Be a collaborative member of a team that is responsible for Okta's production infrastructure, with a focus on scaling our impact and lowering our operational overhead.
- Promote and apply best practices for building scalable and reliable tooling across engineering.
- Be a subject matter expert and partner with our team at Amazon Web Services (AWS).
- Designing, building, running, and monitoring Okta's production infrastructure.
- Driving initiatives to evolve our current platform to increase efficiency and keep it in line with current standards and best practices.
- Responding to production incidents and determining how we can prevent them in the future.
- Identifying and automating manual processes
- Support a 24x7 online environment as part of an on-call rotation.
- Develop and maintain technical documentation, runbooks, and procedures.
Qualifications for the role:
- 3+ years of experience managing large-scale AWS deployments.
- Familiarity with running large codebases in a containerized environment, and the tradeoffs and benefits of such.
- Real-world experience running a modern web stack in production, including HTTP tiers such as haproxy, nginx, or Envoy, application tiers such as Tomcat or Jetty, and NoSQL data or cache tiers such as Elasticsearch, Redis.
- Knowledge and experience with persistent data stores likely to be utilized by a large web application, both SQL and NoSQL.
- Demonstrate excellent Linux fundamentals.
- Have exposure to FedRAMP, SOC2, or other compliance programs.
- 3+ years of experience with automating systems and infrastructure via Ansible, Chef, or Terraform.
- Have experience automating and running large-scale production services in AWS or other cloud providers.
- Can code to a good standard with any programming language, but especially Ruby, Python, or Go, using source control and Agile methodologies.
- Champion excellent written and oral communication skills, with the ability to influence others.
Education and Training:
- BS. Computer Science (plus) or relevant experience
Okta is rethinking the traditional work environment, providing our employees with the flexibility to be their most creative and successful versions of themselves, no matter where the employees are located. We enable a flexible approach to work, meaning you can work from the office or home, regardless of where you live. Okta invests in the best technologies, and provides flexible benefits and collaborative work environments/experiences, empowering employees to work productively in a setting that best and uniquely suits their needs. Find your place at Okta https://www.okta.com/company/careers/.
Okta is an equal opportunity employer.