The company is looking for an experienced software engineer to build and maintain large-scale computation platforms, focusing on backend systems that efficiently orchestrate workloads, route requests, and manage resources while ensuring reliability and scalability with minimal operational load.
Requirements
- Deep experience building distributed compute platforms, preferably with Python
- Strong foundation in managing both cloud and bare metal infrastructure
- Solid understanding of K8s and CI/CD on it
- Deep expertise in backend systems that orchestrate workloads and route requests efficiently, while taking care of capacity and resource constraints
- Strong understanding of foundational cloud infrastructure and Linux provisioning and management tools
- Know how to achieve reliability and scale with minimum operational load
Responsibilities
- Develop and maintain our core Python platform, which handles routing of requests, orchestration of AI workloads, GPU server capacity management, observability, authentication, rate limiting, and many others
- Develop and maintain our infrastructure layer where we use Terraform, Ansible, and provider APIs to manage our fleet of GPU workers
- Own K8s, FluxCD, Nomad, Prometheus, Thanos, Grafana, Loki, distributed networking storage, and other technologies that underpin our platform
- Create the vision and lay the foundation for where our infrastructure should go in the next 1/2/5 years
Other
- Excellent communication
- Self-starter who executes quickly, takes ownership and constantly seeks improvement
- We offer visa sponsorship and will help you relocate to San Francisco.