Runway is looking for an Engineering Manager (Backend) to lead the team responsible for their machine learning platform, focusing on building and scaling the infrastructure for ML at scale, keeping training jobs running smoothly, enabling model evaluation and exploration, and scaling production inference.
Requirements
- 5+ years building distributed systems, data pipelines, and infrastructure at scale.
- Experience managing engineering teams of 3-8 people.
- Experience with cloud platforms (AWS/GCP), container orchestration (Kubernetes/ECS) and operating services at scale.
- You've built reliable systems that handle large data volumes and complex workloads.
- Experience building comprehensive monitoring and alerting.
- You know what metrics matter and how to surface the right information to different teams.
- Comfortable working directly with researchers and data scientists.
Responsibilities
- Build the platform infrastructure for ML at scale.
- Keep training jobs running smoothly.
- Enable model evaluation and exploration.
- Scale production inference.
- Lead the platform engineering team that powers Runway's machine learning pipeline—from data processing through model training to production inference.
- Build monitoring, alerting, and automation around critical multi-day training runs on hundreds of GPUs.
- Maintain the platform that lets researchers inspect training data, visualize outputs, and evaluate model checkpoints.
Other
- Lead the team responsible for Runway's machine learning platform.
- Manage our current and growing team of 5.
- Work closely with our Research and Machine Learning teams to build out our data processing, training, and eval systems.
- Collaborative mindset.
- Humility and open mindedness.