Metropolis is looking to solve the problem of ensuring system reliability at scale for their mission-critical mobility infrastructure, which handles real-time payment processing, customer authentication, and parking facility operations, and requires 99.9%+ uptime.
Requirements
- 8+ years of backend software engineering experience with deep focus on distributed systems and platform infrastructure
- Expert-level Java proficiency with deep understanding of JVM performance, concurrency, and ecosystem tooling
- Production experience with microservices architecture, container orchestration (Kubernetes), and cloud platforms (AWS)
- Strong systems thinking with proven ability to design and implement large-scale, high-availability distributed systems that handle significant load
- Observability expertise including hands-on production experience with metrics, logging, tracing, and alerting systems in high-load environments
- Database and data systems knowledge including relational databases, event streaming (Kafka, SQS), caching strategies, and data consistency patterns
- Experience with AI-powered development tools such as Claude Code, GitHub Copilot, or similar agentic coding tools for enhanced productivity
Responsibilities
- Own the overall reliability posture for the Metropolis platform, establishing practices, metrics, and systems that ensure 99.9%+ uptime across all services
- Design and implement automatic failover mechanisms for critical external dependencies (Twilio for SMS/voice, Stripe for payments) with circuit breakers, retry policies, and degraded mode operations
- Architect and build active-passive or active-active regional deployment strategies with database replication, automated failover, and DNS-based traffic routing including disaster recovery planning and testing
- Establish comprehensive monitoring using Datadog for APM, logs, and metrics correlation; implement synthetic monitoring, SLO-based alerting, on-call rotation, and escalation policies; build service health dashboards that show customer impact
- Own the incident management process including workflows, tooling, post-mortem culture, runbook automation, and MTTR reduction initiatives – driving down mean time to recovery from detection to resolution
- Drive adoption of resilience patterns across all services including health checks, graceful degradation, feature flags, rate limiting, backpressure mechanisms, and chaos engineering practices
- Build and maintain local mirrors for critical dependencies (Maven/NPM/Docker registries) with artifact caching, dependency pinning, and vulnerability scanning to prevent build failures from upstream outages
Other
- Excellent technical communication with ability to design and document complex systems, lead technical discussions, and collaborate across multiple teams
- Ability to work on-site at least four days a week, fostering organic interactions that spark creativity and connection
- Bachelor's, Master's, or Ph.D. degree in Computer Science or related field (not explicitly mentioned but implied)
- Must be authorized to work in the United States
- Must be willing to participate in an automated employment decision tool (AEDT) to assess or evaluate candidacy for employment