Roku is seeking to improve the scalability and reliability of its Big Data and Analytics tech stacks on Cloud infrastructure.
Requirements
- Experience with Cloud infrastructure such as Amazon AWS, Google Cloud Platform (GCP), Microsoft Azure, or other Public Cloud platforms. GCP is preferred.
- Experience with at least 3 of the technologies/tools mentioned here: Big Data / Hadoop, Kafka, Spark, Airflow, Presto, Druid, Opensearch, HA Proxy, or Hive
- Experience with Kubernetes and Docker
- Experience with Terraform
- Strong background in Linux/Unix
- Experience with shell scripting, or equivalent programming skills in Python
- Experience working with monitoring and alerting tools such as Datadog and PagerDuty, and being part of call rotations
Responsibilities
- Develop best practices around cloud infrastructure provisioning, disaster recovery, and guiding developers on the adoption
- Scale Big Data and distributed systems
- Collaborate on system architecture with developers for optimal scaling, resource utilization, fault tolerance, reliability, and availability
- Conduct low-level systems debugging, performance measurement & optimization on large production clusters and low-latency services
- Create scripts and automation that can react quickly to infrastructure issues and take corrective actions
- Participate in architecture discussions, influence product roadmap, and take ownership and responsibility over new projects
- Collaborate and communicate with a geographically distributed team
Other
- Bachelor’s degree, or equivalent work experience
- 8+ years of experience in DevOps or Site Reliability Engineering
- Ability to collaborate and communicate with a geographically distributed team
- Participate in architecture discussions, influence product roadmap, and take ownership and responsibility over new projects