CoreWeave is looking to solve the problem of architecting, building, and operating the next generation of reliable, secure, and massively scalable infrastructure for their GPU-driven data centers, which power advanced AI and large-scale computing workloads.
Requirements
- Expertise in Go and proven experience building REST/gRPC APIs for mission-critical platforms.
- Strong background in architecting and scaling cloud-native Kubernetes infrastructure and distributed services.
- Hands-on experience with observability stacks (Prometheus, Grafana, PromQL), CI/CD pipelines, and operating large fleets of GPU servers.
- Track record of leading incident response, postmortems, and driving robust service reliability.
- Working knowledge of Kafka, ClickHouse and CRDB.
- DMTF, RedFish APIs, and GPU servers.
Responsibilities
- Provide technical leadership in designing, architecting, and operating large-scale infrastructure services for GPU servers, with a focus on security, reliability, and scalability.
- Build and enhance infrastructure services and automation, including inventory management systems and lifecycle management solutions using open source technologies.
- Drive strategic direction for infrastructure automation, lifecycle management, and service orchestration, making MetalDev core services more scalable and resilient.
- Define best practices for API development (REST/gRPC), distributed databases, and Kubernetes orchestration—while mentoring engineers to follow your lead.
- Partner with hardware, software, and operations teams to align infrastructure with business impact.
- Contribute to open source communities (e.g., Go, Redfish) through collaboration and technical thought leadership.
- Lead and improve CI/CD pipelines for hardware compliance, firmware management, and data systems.
Other
- 8+ years of software engineering experience with a strong focus on infrastructure, cloud engineering, and distributed databases—particularly within large-scale datacenter and cloud environments.
- Proven success in mentoring engineers, leading technical projects, and influencing engineering strategy across teams.
- Experience contributing to and collaborating with open source communities.
- Skilled in applying a data-driven approach to reliability, optimization, and continuous improvement.
- Excellent communicator able to work effectively with both technical and non-technical stakeholders.