Box needs to enhance the availability, reliability, and resilience of its systems to drive customer experience and operational excellence.
Requirements
- Experience coding in higher-level languages (e.g., Java, Scala, Go, Python)
- Experience designing complex systems and frameworks using proven system design principles, such as NALSD (Non-Abstract Large System Design) methodologies
- Experience troubleshooting issues across distributed Linux environments, with comfort tracing problems across applications, systems, and networks
- Proficient with modern cloud technologies such as GCP, AWS, and Kubernetes
- Experienced in service observability practices and tools (e.g., Prometheus, OpenTelemetry, SignalFx, or similar)
- Comfortable learning new software, frameworks, and APIs quickly and effectively
- Familiarity with PHP/JavaScript/NodeJS (bonus)
Responsibilities
- Build software, frameworks, and tools required for reliable operations of Box's services across multiple cloud environments
- Manage the stability and operation of several of Box's most critical production applications through application reviews, capacity planning, and performance tuning
- Develop automations / frameworks / tools for better platform reliability/resilience/availability
- Participate in product design reviews and architectural discussions to ensure reliability is considered early in the development lifecycle of product/services
- Participate in a team on-call rotation
- Improve our observability as both a developer/maintainer of systems/frameworks, and a mentor to our product development teams
- Work with modern cloud-native technologies including container orchestration (Kubernetes, Docker), service mesh solutions (Istio, Linkerd), and cloud platforms (AWS, GCP)
Other
- 5+ years of working experience designing, developing, and operating large-scale, customer-facing products or services
- A strong interest in solving challenging problems using innovative and data-driven approaches
- An SRE-centric mindset — you build and manage systems with reliability, scalability, availability, and security as core principles
- Natural collaborator who inspires others, mentors junior engineers, and drives technical excellence
- Work from assigned office a minimum of 2 days per week, with a focus on Tuesdays and Thursdays