CoreWeave is looking to solve the problem of providing AI/ML workload insights to their customers by developing a product suite for Observability. This involves designing, building, and serving these insights at scale, using AI technologies to enable rapid troubleshooting for customers' AI workloads.
Requirements
- Prior experience in building telemetry solutions, such as logging, metrics and tracing, is a plus.
- Practical knowledge of agentic tools and systems is a plus.
- Understanding of cloud computing infrastructure using Kubernetes would be a plus.
Responsibilities
- Define and drive CoreWeave’s Observability insights roadmap and strategy.
- Lead and grow a high-performing team of software engineers and managers.
- Use AI technologies to build solutions that offer insights to customers for rapid troubleshooting of their AI workloads.
- Champion initiatives to improve reliability, durability, and self-healing capabilities of Observability metrics, and assume operational responsibilities.
- Develop operational review practices to assess performance against targets and iterating on those targets.
- Mentor and guide engineering teams on best practices in product engineering, fostering a customer-focused approach to systems design and technical excellence.
Other
- 7+ years of experience in infrastructure, or cloud systems.
- 3+ years in engineering leadership roles, including hiring, scaling, and mentoring teams.
- Proven track record of building and managing product insights using AI tools and techniques.
- Strong communication and interpersonal skills, able to convey storage engineering strategies and practices to technical and non-technical audiences.
- Prior experience in managing through a manager