Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Metropolis Logo

Staff Software Engineer, Reliability

Metropolis

$180,000 - $200,000
Dec 16, 2025
New York, NY, US
Apply Now

Metropolis is looking to solve the problem of ensuring system reliability at scale for their mission-critical mobility infrastructure, which handles real-time payment processing, customer authentication, and parking facility operations, and requires 99.9%+ uptime across all services

Requirements

  • 8+ years of backend software engineering experience with deep focus on distributed systems and platform infrastructure
  • Expert-level Java proficiency with deep understanding of JVM performance, concurrency, and ecosystem tooling
  • Production experience with microservices architecture, container orchestration (Kubernetes), and cloud platforms (AWS)
  • Strong systems thinking with proven ability to design and implement large-scale, high-availability distributed systems that handle significant load
  • Observability expertise including hands-on production experience with metrics, logging, tracing, and alerting systems in high-load environments
  • Database and data systems knowledge including relational databases, event streaming (Kafka, SQS), caching strategies, and data consistency patterns
  • Experience with AI-powered development tools such as Claude Code, GitHub Copilot, or similar agentic coding tools for enhanced productivity

Responsibilities

  • Own the overall reliability posture for the Metropolis platform, establishing practices, metrics, and systems that ensure 99.9%+ uptime across all services
  • Design and implement automatic failover mechanisms for critical external dependencies (Twilio for SMS/voice, Stripe for payments) with circuit breakers, retry policies, and degraded mode operations
  • Architect and build active-passive or active-active regional deployment strategies with database replication, automated failover, and DNS-based traffic routing including disaster recovery planning and testing
  • Establish comprehensive monitoring using Datadog for APM, logs, and metrics correlation; implement synthetic monitoring, SLO-based alerting, on-call rotation, and escalation policies; build service health dashboards that show customer impact
  • Own the incident management process including workflows, tooling, post-mortem culture, runbook automation, and MTTR reduction initiatives – driving down mean time to recovery from detection to resolution
  • Drive adoption of resilience patterns across all services including health checks, graceful degradation, feature flags, rate limiting, backpressure mechanisms, and chaos engineering practices
  • Build and maintain local mirrors for critical dependencies (Maven/NPM/Docker registries) with artifact caching, dependency pinning, and vulnerability scanning to prevent build failures from upstream outages

Other

  • Excellent technical communication with ability to design and document complex systems, lead technical discussions, and collaborate across multiple teams
  • Ability to work on-site at least four days a week, fostering organic interactions that spark creativity and connection
  • Bachelor's, Master's, or Ph.D. degree in Computer Science or related field (not explicitly mentioned but implied)
  • 8+ years of experience (as mentioned in technical requirements)
  • Must be authorized to work in the United States (not explicitly mentioned but implied)