The company is looking to lead the design, development, and operation of production-grade machine learning infrastructure at scale, architect robust pipelines, deploy and monitor ML models, and ensure reliability, reproducibility, and governance across their AI/ML ecosystem.
Requirements
- Strong programming skills (Python, Go, or Java) with deep expertise in building production systems
- Experience with cloud platforms (AWS, GCP, Azure) and container orchestration (Kubernetes, Docker)
- Proven experience in ML infrastructure: model serving (TensorFlow Serving, TorchServe, Triton), workflow orchestration (Airflow, Kubeflow, MLflow, Ray, Vertex AI, SageMaker)
- Hands-on experience with CI/CD pipelines, infrastructure-as-code (Terraform, Helm), and monitoring/observability tools (Prometheus, Grafana, ELK/EFK stack)
- Strong knowledge of data pipelines, feature stores, and streaming systems (Kafka, Spark, Flink)
- Understanding of model monitoring, drift detection, retraining pipelines, and governance frameworks
Responsibilities
- Lead MLOps architecture: Design and implement scalable ML platforms, CI/CD pipelines, and deployment workflows across cloud and hybrid environments
- Operationalize ML models: Build automated systems for training, testing, deployment, monitoring, and rollback of ML models in production
- Ensure reliability and governance: Implement model versioning, reproducibility, auditing, and compliance best practices
- Drive observability & monitoring: Develop real-time monitoring, alerting, and logging solutions for ML services, ensuring performance, drift detection, and system health
- Champion automation & efficiency: Reduce friction between data science, engineering, and operations by introducing best practices for infrastructure-as-code, container orchestration, and continuous delivery
- Collaborate cross-functionally: Partner with ML engineers, data scientists, security teams, and product engineering to deliver robust, production-ready AI systems
- Lead innovation in MLOps: Evaluate and introduce new tools, frameworks, and practices that elevate the scalability, reliability, and security of ML operations
Other
- In office 3 days a week. Not a remote role.
- Ability to influence cross-functional stakeholders, define best practices, and mentor engineers at all levels
- Passion for operational excellence, scalability, and securing ML systems in mission-critical environments