Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. Lambda’s mission is to make compute as ubiquitous as electricity and give every person access to artificial intelligence. One person, one GPU. The HPC Deployments team are responsible for deploying cutting edge NVIDIA GPU clusters on time, at scale and with 100% quality & correctness.
Requirements
- Extensive experience in HPC or large-scale infrastructure, including at least 3 years in a leadership or management role.
- Have excellent problem solving and troubleshooting skills
- Are comfortable leading and mentoring HPC engineers on cluster deployments as needed
- Experience building a high-performance team through deliberate hiring, upskilling, planned skills redundancy, performance-management, and expectation setting.
- Experience with Linux systems administration, automation, scripting/coding.
- Experience with containerization technologies (Docker, Kubernetes)
- Experience working with the technologies that underpin our cloud business (GPU acceleration, virtualization, and cloud computing)
Responsibilities
- Lead and grow a distributed top-talent team of HPC engineers responsible for the configuration, validation, deployment of large scale GPU clusters.
- Identify opportunities for efficiency improvements in the tools / process / automation that the team relies upon day to day.
- Stay current on the latest HPC/AI technologies and best practices
- Participate in the qualification efforts of new technologies for use in our production deployments
- Experience with Linux systems administration, automation, scripting/coding.
- Experience with containerization technologies (Docker, Kubernetes)
- Experience working with the technologies that underpin our cloud business (GPU acceleration, virtualization, and cloud computing)
Other
- This position requires presence in our San Francisco/San Jose or Seattle office location 4 days per week; Lambda’s designated work from home day is currently Tuesday.
- Work cross functionally with teams in the organization to deliver projects and deployments on time, ensuring alignment across stakeholders.
- Ensure stakeholders have clear visibility into deployment progress, risks, and outcomes.
- Drive outcomes by managing staff allocations, project priorities, deadlines, and deliverables.
- Conduct regular one-on-one meetings, provide constructive feedback, and support career development for team members.