Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Software Engineer, AI Reliability Engineering

Anthropic

$320,000 - $485,000

Aug 18, 2025

San Francisco, CA, US

Anthropic is seeking to define and achieve reliability metrics for all of Anthropic’s internal and external products and services, while significantly improving reliability for Anthropic’s services and using AI models to reengineer the way they work.

Requirements

Have extensive experience with distributed systems observability and monitoring at scale
Understand the unique challenges of operating AI infrastructure, including model serving, batch inference, and training pipelines
Have proven experience implementing and maintaining SLO/SLA frameworks for business-critical services
Are comfortable working with both traditional metrics (latency, availability) and AI-specific metrics (model performance, training convergence)
Have experience with chaos engineering and systematic resilience testing
Can effectively bridge the gap between ML engineers and infrastructure teams
Have experience operating large-scale model training infrastructure or serving infrastructure (>1000 GPUs)

Responsibilities

Develop appropriate Service Level Objectives for large language model serving and training systems, balancing availability/latency with development velocity.
Design and implement monitoring systems including availability, latency and other salient metrics.
Assist in the design and implementation of high-availability language model serving infrastructure capable of handling the needs of millions of external customers and high-traffic internal workloads.
Develop and manage automated failover and recovery systems for model serving deployments across multiple regions and cloud providers.
Lead incident response for critical AI services, ensuring rapid recovery and systematic improvements from each incident
Build and maintain cost optimization systems for large-scale AI infrastructure, focusing on accelerator (GPU/TPU/Trainium) utilization and efficiency

Other

We require at least a Bachelor's degree in a related field or equivalent experience.
Currently, we expect all staff to be in one of our offices at least 25% of the time.
We do sponsor visas!
Have excellent communication skills.
We greatly value communication skills.