About the Team
The Infrastructure Engineering function sits within IT and is responsible for reliably building, deploying, and operating critical on prem and hybrid environments that power internal services and critical R&D environments.
This is an early, high-leverage technical role focused on applying strong Site Reliability Engineering discipline to environments where uptime, safety, recoverability, and security are non-negotiable. This person helps replace bespoke, one-off infrastructure with standardized infrastructure-as-code building blocks that compound reliability and operational leverage as OpenAI scales.
About the Role
We are looking for an experienced Security Reliability Engineer to design, build, and operate reliable, secure, and scalable infrastructure that underpins identity, access, endpoint, and shared platform services across the company.
In this role, you will be a senior technical owner for infrastructure and identity systems end to end, from architecture and implementation through policy enforcement, upgrades, recovery, and day-two operations. You will build durable, production-grade platforms that remove operational friction, enforce security by default, and enable teams to move faster with confidence.
This role is well suited for a hands-on senior engineer who thrives in ambiguity, enjoys owning complex systems end to end, and raises the reliability and security bar by replacing fragile implementations with standardized, repeatable infrastructure.
This role is based in our San Francisco HQ and requires in-office presence.
In this role, you will
- Design, build, and operate reliable infrastructure across on-prem, hybrid, shared, and product adjacent environments.
- Establish standardized infrastructure patterns that replace bespoke implementations with repeatable, auditable, secure-by-default systems.
- Own the lifecycle of critical infrastructure platforms, including provisioning, deployment, upgrades, patching, recovery, and long-term reliability.
- Build infrastructure-as-code and configuration management using tools such as Terraform, Chef, and Ansible.
- Mature identity adjacent and policy enforced infrastructure, including Microsoft Entra and Azure management patterns.
- Build observability, alerting, and incident response mechanisms that improve availability, recoverability, and operational confidence.
- Automate high-toil and high-risk workflows with guardrails, progressive rollout patterns, and safe rollback paths.
- Translate incidents, design reviews, and operational learnings into durable fixes, reusable patterns, and stronger technical standards.
You might thrive in this role if you
- Have 10+ years of hands-on experience operating and architecting mission-critical infrastructure in high-reliability environments
- Have been the senior technical owner for the design and maturation of complex on-prem, hybrid, or cloud-integrated systems, setting durable architectural patterns used by multiple teams
- Apply Site Reliability Engineering principles at scale, using observability, automation, and incident learnings to materially reduce risk and operational toil
- Operate comfortably in ambiguity, making sound architectural decisions under pressure while staying close to technical detail
- Influence cross-functional partners across security, identity, network, and platform teams through architecture, implementation, operational data, and clear technical writing
- Have experience operating infrastructure for R&D or specialized labs, manufacturing, or other safety critical environments where uptime and recoverability are essential
- Have experience with fleet, endpoint, or virtual desktop platforms such as FleetDM, Chef, or Azure Virtual Desktop
- Have experience partnering closely with identity or security engineering teams on hardened, policy enforced infrastructure at scale
About OpenAI
OpenAI is an AI research and deployment company dedicated to ensurin