CoreWeave is seeking a Sr. Infrastructure Engineer to develop, deploy, and monitor services that manage their bare-metal infrastructure.
Requirements
- Understanding of cloud platforms (e.g., Kubernetes, AWS, GCP) and basic knowledge of cloud infrastructure
- Familiarity with incident management practices and frameworks (e.g., ITIL, SRE best practices)
- Proficiency with Go
- Prior experience with Prometheus / Grafana
- Previous experience deploying containerized applications using Kubernetes
- Experience with Redfish-based projects
Responsibilities
- Assist in incident response efforts by helping identify and resolve service disruptions quickly
- Monitor system performance and health using tools like Prometheus and Grafana
- Implement automation and process improvements to enhance efficiency and reduce manual intervention in incident detection and recovery
- Collaborate with engineers across teams to improve platform reliability, resilience improvements, and disaster recovery
- Create CI/CD pipelines
- Ensure smooth operation of all aspects of the server hardware lifecycle, from provisioning to end-of-life
- Build out dashboards and alerts to make efficient operational troubleshooting
Other
- 4 years of experience in cloud operations, site reliability engineering (SRE), or related technical roles
- Excellent documentation skills and attention to detail
- Strong analytical and problem-solving abilities
- Served on an on-call rotation supporting production services
- Applicants must have work authorization that does not require sponsorship from the company now or in the future