Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Software Engineer, Fleet Management

Open AI

Salary not specified

Nov 19, 2025

San Francisco, CA, US

OpenAI needs to build systems to manage large-scale computing environments for AI research and product development, ensuring high availability, performance, and efficiency while streamlining the research user experience.

Requirements

strong software engineering skills with experience in large-scale infrastructure environments.
broad knowledge of cluster-level systems (e.g., Kubernetes, CI/CD pipelines, Terraform, cloud providers).
deep expertise in server-level systems (e.g., systemd, containerization, Chef, Linux kernels, firmware management, host routing).

Responsibilities

Design and build systems to manage both cloud and bare-metal fleets at scale.
Develop tools that integrate low-level hardware metrics with high-level job scheduling and cluster management algorithms.
Leverage LLMs to coordinate vendor operations and optimize infrastructure workflows.
Automate infrastructure processes, reducing repetitive toil and improving system reliability.
Collaborate with hardware, infrastructure, and research teams to ensure seamless integration across the stack.
Continuously improve tools, automation, processes, and documentation to enhance operational efficiency.

Other

This role is based in San Francisco, CA.
We use a hybrid work model of 3 days in the office per week.
offer relocation assistance to new employees.
passionate about optimizing the performance and reliability of large compute fleets.
Thrive in dynamic environments and are eager to solve complex infrastructure challenges.