Anthropic needs an ML Infrastructure Engineering Manager to lead a team that builds and scales the systems powering their AI safety and trust mechanisms, ensuring their AI models operate safely and reliably at scale.
Requirements
- 4+ years of management experience leading technical teams focused on ML infrastructure, platform engineering, or distributed systems
- 8+ years of hands-on experience building production ML infrastructure, ideally in safety-critical domains like fraud detection, content moderation, or risk assessment
- Possess deep technical knowledge of ML serving platforms, feature stores, data pipelines, and distributed systems architecture
- Knowledge of modern ML frameworks, cloud platforms, and container orchestration in production environments
- Experience implementing automated testing, deployment, and monitoring systems for ML models in production
- Have managed teams working on real-time, high-throughput systems with strict latency and reliability requirements
- Experience with compliance and security requirements for safety-critical applications
Responsibilities
- Set team vision and roadmap for ML infrastructure that powers Anthropic's safety and trust systems, ensuring scalability, reliability, and performance at production scale
- Lead a team of ML infrastructure and software engineers to build robust platforms supporting real-time safety evaluations, feature stores, model serving, and data pipelines
- Partner with Safeguards, Security, Research, and Product teams to identify infrastructure requirements and translate complex safety research into scalable production systems
- Drive technical strategy for ML infrastructure architecture, making key decisions about technology choices, system design, and platform evolution
- Maintain deep technical expertise in ML infrastructure, distributed systems, and safety-critical applications to provide technical leadership and guidance
- Collaborate across teams to ensure infrastructure supports rapid experimentation while maintaining production reliability and safety standards
- Champion engineering best practices including automated testing, deployment pipelines, monitoring, and incident response for safety-critical systems
Other
- Demonstrated ability to lead and manage high-performing technical teams through periods of rapid growth and scaling challenges
- Show excellent communication skills in translating complex technical concepts for various audiences, from individual contributors to executive leadership
- Have strong project management skills with the ability to balance multiple priorities and coordinate across cross-functional teams
- Experience managing teams that bridge research and production, with a track record of productionizing experimental systems
- Demonstrate passion for ensuring the responsible development and deployment of AI systems
- We require at least a Bachelor's degree in a related field or equivalent experience.
- Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.
- We do sponsor visas! However, we aren't able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.