Our Site Reliability Engineering (SRE) team is looking to solve scaling and reliability challenges in large-scale, distributed systems with high reliability and efficiency
Requirements
- Deep hands-on expertise in at least one of the following areas: Databases (SQL/NoSQL), Kubernetes or container orchestration, Big Data processing and storage systems (streaming and batch)
- Strong knowledge of system architecture, distributed systems, and performance bottlenecks
- Proven track record of driving automation, tooling, and process improvements that enhance reliability and efficiency
- Experience in cost optimization and performance tuning at scale, backed by data-driven decision making
- Thought leadership in adopting new technologies, improving operational practices, and influencing system design
Responsibilities
- Strong hands-on skills in the design, development, and operation of large-scale cloud infrastructure and distributed systems
- Collaborate with cross-functional teams (e.g., Advertising, Machine Learning, E-commerce, and Core Infra) to drive system reliability, performance, and scalability
- Lead initiatives to automate operations, eliminate toil, and improve overall system efficiency
- Troubleshoot complex production issues, perform root-cause analysis, and drive long-term reliability improvements
- Promote best practices in system design, observability, performance optimization, and cost efficiency
- Communicate complex technical concepts effectively to both technical and non-technical stakeholders
Other
- 5+ years of experience in Site Reliability Engineering, Software Development, or related fields
- Excellent communication and collaboration skills, with experience working across engineering, product, and data science teams
- Ability to influence without formal authority
- Strong problem-solving ability
- Clear communication