The Fleet team at OpenAI supports the computing environment that powers cutting-edge research and product development, and this role aims to build systems to manage hardware, configurations, vendors, and people interacting with the infrastructure to advance AI research.
Requirements
- Strong software engineering skills with experience in large-scale infrastructure environments.
- Broad knowledge of cluster-level systems (e.g., Kubernetes, CI/CD pipelines, Terraform, cloud providers).
- Deep expertise in server-level systems (e.g., systemd, containerization, Chef, Linux kernels, firmware management, host routing).
- Passionate about optimizing the performance and reliability of large compute fleets.
- Thrive in dynamic environments and are eager to solve complex infrastructure challenges.
- Value automation, efficiency, and continuous improvement in everything you build.
Responsibilities
- Design and build systems to manage both cloud and bare-metal fleets at scale.
- Develop tools that integrate low-level hardware metrics with high-level job scheduling and cluster management algorithms.
- Leverage LLMs to coordinate vendor operations and optimize infrastructure workflows.
- Automate infrastructure processes, reducing repetitive toil and improving system reliability.
- Collaborate with hardware, infrastructure, and research teams to ensure seamless integration across the stack.
- Continuously improve tools, automation, processes, and documentation to enhance operational efficiency.
Other
- 3 days in the office per week with a hybrid work model.
- Relocation assistance to new employees.
- Equal opportunity employer with no discrimination on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic.
- Background checks for applicants will be administered in accordance with applicable law.
- Committed to providing reasonable accommodations to applicants with disabilities.