At FanDuel, our data platforms power mission-critical products used by millions of customers every day. To meet the scale, speed, and complexity of our business, we need to ensure our data services are resilient, performant, and highly reliable—even in the face of failures or large-scale disruption.
Requirements
- Strong knowledge of AWS services (EC2, S3, RDS, Lambda, CloudWatch, etc.) and infrastructure-as-code (Terraform, CloudFormation, Ansible).
- Hands-on experience with data platform technologies (Databricks, Redshift, Airflow, dbt, Spark, Kafka).
- Proficiency in at least one programming language (Python, Go, Java, or similar).
- Experience with monitoring, observability, and log analysis tools (Prometheus, Grafana, DataDog, ELK, Splunk).
- Solid understanding of CI/CD pipelines, version control (Git), and automation practices.
- Proven ability to design and implement business continuity and disaster recovery solutions.
- Strong incident management skills, including root cause analysis and post-mortem processes.
Responsibilities
- Design and maintain highly available, scalable data platform infrastructure on AWS (Databricks, Redshift, Airflow).
- Collaborate with engineering teams to optimize performance across clusters, pipelines, and workflows.
- Implement infrastructure-as-code (Terraform, CloudFormation, CDK) to automate deployment and recovery.
- Establish and monitor SLOs/SLIs, building observability and automation for reliability at scale.
- Lead business continuity and disaster recovery planning, including cross-region failover and backup strategies.
- Drive incident management end-to-end: response, escalation, post-mortems, and root cause analysis.
- Champion continuous improvement, applying learnings from incidents and performance data to strengthen resilience.
Other
- 5+ years of experience in Site Reliability Engineering, DevOps, Infrastructure, or Data Platform Engineering.
- Excellent communication and documentation skills, with ability to influence and advocate for best practices across teams.
- Demonstrated track record of improving system reliability, performance, and efficiency at scale.
- Ability to work in a team environment and collaborate with engineering and operations teams.
- Commitment to equal employment opportunity regardless of race, color, ethnicity, ancestry, religion, creed, sex, national origin, sexual orientation, age, citizenship status, marital status, disability, gender identity, gender expression, veteran status, or any other characteristic protected by state, local or federal law.