At WHOOP, the business problem is to unlock human performance and healthspan by building durable machine learning platforms that enable teams to develop, deploy, and operate models safely and at scale.
Requirements
- Strong programming skills in Python, with experience in building distributed systems and REST/gRPC APIs.
- Deep knowledge of cloud-native services and infrastructure-as-code (e.g., AWS CDK, Terraform, CloudFormation).
- Hands-on experience with model deployment platforms such as AWS SageMaker, Vertex AI, or Kubernetes-based serving stacks.
- Proficiency in ML lifecycle tools (MLflow, Weights & Biases, BentoML) and containerization strategies (Docker, Kubernetes).
- Understanding of data engineering and ingestion pipelines, with ability to interface with data lakes, feature stores, and streaming systems.
- Experience with building scalable ML infrastructure in cloud environments (e.g., AWS).
- Knowledge of MLOps infrastructure (e.g., MLflow, feature store, experiment tracking, model registry).
Responsibilities
- Architect, build, own, and operate scalable ML infrastructure in cloud environments (e.g., AWS), optimizing for speed, observability, cost, and reproducibility.
- Create, support, and maintain core MLOps infrastructure (e.g., MLflow, feature store, experiment tracking, model registry), ensuring reliability, scalability, and long-term sustainability.
- Develop, evolve, and operate MLOps platforms and frameworks that standardize model deployment, versioning, drift detection, and lifecycle management at scale.
- Implement and continuously maintain end-to-end CI/CD pipelines for ML models using orchestration tools (e.g., Prefect, Airflow, Argo Workflows), ensuring robust testing, reproducibility, and traceability.
- Build, manage, and maintain both real-time and batch inference infrastructure, supporting diverse use cases from physiological analytics to personalized feedback loops for WHOOP members.
- Design, implement, and own automated observability tooling (e.g., for model latency, data drift, accuracy degradation), integrating metrics, logging, and alerting with existing platforms.
- Leverage AI-powered tools and automation to reduce operational overhead, enhance developer productivity, and accelerate model release cycles.
Other
- Bachelor’s or Master’s Degree in Computer Science, Engineering, or a related field; or equivalent practical experience.
- 5+ years of experience in software engineering with a focus on ML infrastructure, cloud platforms, or MLOps.
- Proven ability to work cross-functionally with Data Science, Data Platform, and Software Engineering teams, influencing decisions and driving alignment.
- Passion for AI and automation to solve real-world problems and improve operational workflows.
- Must be prepared to relocate if necessary to work out of the Boston, MA office.