Anthropic is seeking talented and experienced Infrastructure Engineers to support the development, scaling, and maintenance of cutting-edge AI systems to create safe and reliable AI systems that benefit humanity.
Requirements
- Strong proficiency in at least one programming language (e.g., Python, Rust, Go, Java)
- Deep knowledge of modern cloud infrastructure including Kubernetes, Infrastructure as Code, AWS, and GCP
- Security and privacy best practice expertise
- Experience with machine learning infrastructure like GPUs, TPUs, or Trainium, as well as supporting networking infrastructure like NCCL
- Low level systems experience, for example linux kernel tuning and eBPF
- Technical expertise: Quickly understanding systems design tradeoffs, keeping track of rapidly evolving software systems
Responsibilities
- Lead build out of industry-leading AI clusters (thousands to hundreds of thousands of machines), partnering closely with cloud service providers on cluster build out and required features
- Consult with different stakeholders to deeply understand infrastructure, data and compute needs, identifying potential solutions to support frontier research and product development
- Set technical strategy and oversee development of high scale, reliable infrastructure systems.
- Mentor top technical talent
- Design processes (e.g. postmortem review, incident response, on-call rotations) that help the team operate effectively and never fail the same way twice
Other
- At least a Bachelor's degree in a related field or equivalent experience
- Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time
- Visa sponsorship: We do sponsor visas! However, we aren't able to successfully sponsor visas for every role and every candidate
- Excellent communication skills to build consensus with stakeholders, both internally and externally
- Passion for supporting internal partners like research to understand their needs