Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Senior ML Ops Engineer

MasterControl

$200,000 - $250,000

Oct 27, 2025

Remote, US

MasterControl is building an internal AI Platform to power intelligent, scalable, and compliant AI systems in regulated industries, and needs an MLOps Engineer to automate, monitor, and scale machine learning workloads.

Requirements

Strong expertise in Kubernetes, container orchestration, and cloud-native architecture (AWS preferred), specifically with GPUs.
Proficiency in infrastructure-as-code (Terraform, Helm, Kustomize) and cloud platforms (AWS preferred).
Familiar with artifact tracking, experiment management, and model registries (e.g., MLflow, W&B, SageMaker Experiments).
Strong Python engineering skills and experience debugging ML workflows at scale.
Experience deploying and scaling inference workloads using modern ML frameworks
Deep understanding of CI/CD systems and their role in ML production.
Working knowledge of monitoring and alerting systems for ML workloads.

Responsibilities

Design and maintain infrastructure for training, evaluating, and deploying machine learning models at scale.
Manage GPU orchestration on Kubernetes (EKS), including node autoscaling, bin-packing, taints/tolerations, and cost-aware scheduling strategies (e.g., spot/preemptible GPUs).
Build and optimize CI/CD pipelines for ML code, data versioning, and model artifacts using tools like GitHub Actions, Argo Workflows, and Terraform.
Manage and optimize containerized ML workloads on Kubernetes (EKS), including node auto-scaling, GPU orchestration, and runtime scheduling.
Develop and maintain observability for model and pipeline health (e.g., using Prometheus, Grafana, OpenTelemetry).
Collaborate with Data Scientists and ML Engineers to productionize notebooks, pipelines, and models.
Implement and work with security and compliance to bring best practices around model serving and data access

Other

5+ years of experience in MLOps, infrastructure, or platform engineering.
Experience setting up and scaling training and fine-tuning pipelines for ML models in production environments.
Hands-on with training frameworks like PyTorch Lightning, Hugging Face Accelerate, or DeepSpeed.
A strong sense of ownership and commitment to quality, security, and operational excellence.
Applicants must be currently authorized to work in the United States on a full-time basis.