CoreWeave is seeking Software Engineers to build, maintain, and optimize highly scalable, reliable, and secure systems for their AI Hyperscaler platform. The Observability team specifically needs to deploy and maintain critical infrastructure including logging, tracing, and metrics platforms, as well as the pipelines that feed them, to ensure the performance and resilience of their accelerated computing solutions.
Requirements
- Proficiency in at least one programming or scripting language (e.g., Python, Go).
- Experience working in Kubernetes, containerization, and microservices architectures.
- Experience being on call, triaging and escalating (when appropriate) production issues.
- History of consuming observability systems at scale.
- Experience running a production observability database or tool (e.g. ClickHouse, Elastic, Loki, Victoria Metrics, Prometheus, Thanos, OpenTelemetry, and/or Grafana).
- Familiarity with infrastructure-as-code tools like Terraform.
- Hands-on experience using data-streaming systems for observability pipelines.
Responsibilities
- Design, build and maintain logging, tracing, and/or metrics platforms with moderate supervision.
- Develop and refine monitoring and alerting to enhance system reliability.
- Assist engineers across CoreWeave in developing effective usage patterns for Observability systems.
- Manage production and pre-production clusters, building tools to enable development teams to follow best practices.
Other
- 2-5 years of experience in Software Engineering, Site Reliability Engineering, DevOps, or a related field.
- Excellent problem-solving, analytical, and communication skills.
- Exposure to modern testing frameworks and progressive deployment strategies
- The base salary range for this role is $109,000 to $145,000.
- This position requires access to export controlled information.