Liftoff is looking to improve the reliability and performance of its machine learning systems at scale.
Requirements
- Deep expertise in Python and/or Go
- Fluency with ML libraries (e.g., TensorFlow, PyTorch)
- Experience with cloud infrastructure (e.g., AWS)
- Experience with ML monitoring tools (e.g. Prometheus, Grafana)
- Experience in big data engines such as Trino and Spark is a big plus
- Solid core CS fundamentals (data structures, algorithms, architecting systems)
- Experience in ML systems for training Transformer models, CTR prediction models
Responsibilities
- Lead the design and evolution of large-scale ML infrastructure
- Define and implement end-to-end monitoring, alerting, and performance tracking for ML models and data pipelines
- Partner with data scientists and platform teams to standardize and scale model deployment, versioning, and A/B experimentation frameworks
- Lead and participate in incident response efforts, conducting root cause analysis and implementing corrective actions to prevent recurrence
- Identify systemic inefficiencies and opportunities for automation or simplification, and drive cross-functional efforts to improve system performance and developer productivity
- Drive adoption of best practices in software and ML engineering, including code quality, risk-driven testing, and explainable, maintainable systems
- Act as a mentor and multiplier, helping other engineers level up in ML systems, reliability, and architectural thinking
Other
- BS in Computer Science with 8+ years of professional experience; or MS in Computer Science with 6+ years of professional experience; or PhD with 3+ years of professional experience
- Proven ability to drive large technical initiatives and lead projects spanning multiple teams
- Strong problem-solving skills and the ability to work collaboratively across teams
- Ability to lead across team and role boundaries to effect large scale change in culture and systems
- Travel expectations: attend in-person team gatherings at least once per quarter