OpenAI needs to build systems to manage large-scale computing environments for AI research and product development, ensuring high availability, performance, and efficiency while streamlining the research user experience.
Requirements
- strong software engineering skills with experience in large-scale infrastructure environments.
- broad knowledge of cluster-level systems (e.g., Kubernetes, CI/CD pipelines, Terraform, cloud providers).
- deep expertise in server-level systems (e.g., systemd, containerization, Chef, Linux kernels, firmware management, host routing).
Responsibilities
- Design and build systems to manage both cloud and bare-metal fleets at scale.
- Develop tools that integrate low-level hardware metrics with high-level job scheduling and cluster management algorithms.
- Leverage LLMs to coordinate vendor operations and optimize infrastructure workflows.
- Automate infrastructure processes, reducing repetitive toil and improving system reliability.
- Collaborate with hardware, infrastructure, and research teams to ensure seamless integration across the stack.
- Continuously improve tools, automation, processes, and documentation to enhance operational efficiency.
Other
- This role is based in San Francisco, CA.
- We use a hybrid work model of 3 days in the office per week.
- offer relocation assistance to new employees.
- passionate about optimizing the performance and reliability of large compute fleets.
- Thrive in dynamic environments and are eager to solve complex infrastructure challenges.