Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Staff Software Engineer, Reliability

Metropolis

$180,000 - $200,000

Dec 16, 2025

New York, NY, US

Metropolis is looking to solve the problem of ensuring system reliability at scale for their mission-critical mobility infrastructure, which handles real-time payment processing, customer authentication, and parking facility operations, and requires 99.9%+ uptime across all services

Requirements

8+ years of backend software engineering experience with deep focus on distributed systems and platform infrastructure
Expert-level Java proficiency with deep understanding of JVM performance, concurrency, and ecosystem tooling
Production experience with microservices architecture, container orchestration (Kubernetes), and cloud platforms (AWS)
Strong systems thinking with proven ability to design and implement large-scale, high-availability distributed systems that handle significant load
Observability expertise including hands-on production experience with metrics, logging, tracing, and alerting systems in high-load environments
Database and data systems knowledge including relational databases, event streaming (Kafka, SQS), caching strategies, and data consistency patterns
Experience with AI-powered development tools such as Claude Code, GitHub Copilot, or similar agentic coding tools for enhanced productivity

Responsibilities

Own the overall reliability posture for the Metropolis platform, establishing practices, metrics, and systems that ensure 99.9%+ uptime across all services
Design and implement automatic failover mechanisms for critical external dependencies (Twilio for SMS/voice, Stripe for payments) with circuit breakers, retry policies, and degraded mode operations
Architect and build active-passive or active-active regional deployment strategies with database replication, automated failover, and DNS-based traffic routing including disaster recovery planning and testing
Establish comprehensive monitoring using Datadog for APM, logs, and metrics correlation; implement synthetic monitoring, SLO-based alerting, on-call rotation, and escalation policies; build service health dashboards that show customer impact
Own the incident management process including workflows, tooling, post-mortem culture, runbook automation, and MTTR reduction initiatives – driving down mean time to recovery from detection to resolution
Drive adoption of resilience patterns across all services including health checks, graceful degradation, feature flags, rate limiting, backpressure mechanisms, and chaos engineering practices
Build and maintain local mirrors for critical dependencies (Maven/NPM/Docker registries) with artifact caching, dependency pinning, and vulnerability scanning to prevent build failures from upstream outages

Other

Excellent technical communication with ability to design and document complex systems, lead technical discussions, and collaborate across multiple teams
Ability to work on-site at least four days a week, fostering organic interactions that spark creativity and connection
Bachelor's, Master's, or Ph.D. degree in Computer Science or related field (not explicitly mentioned but implied)
8+ years of experience (as mentioned in technical requirements)
Must be authorized to work in the United States (not explicitly mentioned but implied)