Lambda is looking to build and operate large scale monitoring systems for their AI cloud product suite, deploying observability solutions across the stack to keep offerings reliable and instantly detect issues in high-performance AI clusters.
Requirements
- Experience with a wide variety of modern open-source observability software.
- Strong background in software engineering and the SDLC.
- Extensive experience with site reliability engineering and ability to champion improved SRE practices.
- Experience building a high-performance team through deliberate hiring, upskilling, performance-management, and expectation setting.
- Experience with Kubernetes, designing scalable distributed systems
Responsibilities
- We build and operate mission-critical platforms for metrics, logs, and traces based on both open-source software and systems developed in-house.
- We design observability solutions for large-scale AI clusters running the latest GPU, Networking, and Storage technologies.
- We engage across the company to promote best practices, help teams adopt our platforms, and enable applications that require observability data.
- Work with the engineering team to drive strategy for Lambda internal and customer observability solutions.
- Improve observability of AI infrastructure and develop new monitoring solutions as new products are introduced.
- Lead team in the continued development of our existing Metrics solutions based on the Prometheus and OpenTelemetry ecosystems.
- Lead team in tasks related to delivery of new Logging and Tracing solutions based on Clickhouse.
Other
- Note: This position requires presence in our San Francisco or Seattle office location 4 days per week; Lambda’s designated work from home day is currently Tuesday.
- 10+ years of experience in observability systems or platform engineering with at least 3 years in a management or lead role.
- Demonstrated experience leading a team of engineers and SREs on complex, cross-functional projects in a fast-paced startup environment.
- Strong project management skills, leading planning, project execution, and delivery of team outcomes on schedule.
- Experience driving cross-functional engineering management initiatives (coordinating events, strategic planning, coordinating large projects).