Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Anthropic Logo

Software Engineer, AI Reliability Engineering

Anthropic

$320,000 - $485,000
Aug 18, 2025
San Francisco, CA, US
Apply Now

Anthropic is seeking to define and achieve reliability metrics for all of Anthropic’s internal and external products and services, while significantly improving reliability for Anthropic’s services and using AI models to reengineer the way they work.

Requirements

  • Have extensive experience with distributed systems observability and monitoring at scale
  • Understand the unique challenges of operating AI infrastructure, including model serving, batch inference, and training pipelines
  • Have proven experience implementing and maintaining SLO/SLA frameworks for business-critical services
  • Are comfortable working with both traditional metrics (latency, availability) and AI-specific metrics (model performance, training convergence)
  • Have experience with chaos engineering and systematic resilience testing
  • Can effectively bridge the gap between ML engineers and infrastructure teams
  • Have experience operating large-scale model training infrastructure or serving infrastructure (>1000 GPUs)

Responsibilities

  • Develop appropriate Service Level Objectives for large language model serving and training systems, balancing availability/latency with development velocity.
  • Design and implement monitoring systems including availability, latency and other salient metrics.
  • Assist in the design and implementation of high-availability language model serving infrastructure capable of handling the needs of millions of external customers and high-traffic internal workloads.
  • Develop and manage automated failover and recovery systems for model serving deployments across multiple regions and cloud providers.
  • Lead incident response for critical AI services, ensuring rapid recovery and systematic improvements from each incident
  • Build and maintain cost optimization systems for large-scale AI infrastructure, focusing on accelerator (GPU/TPU/Trainium) utilization and efficiency

Other

  • We require at least a Bachelor's degree in a related field or equivalent experience.
  • Currently, we expect all staff to be in one of our offices at least 25% of the time.
  • We do sponsor visas!
  • Have excellent communication skills.
  • We greatly value communication skills.