Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Sr ML Ops Engineer

Disney Entertainment & ESPN Technology

$152,100 - $203,900

Sep 18, 2025

Nicasio, CA, USA

Skywalker Sound Development Group is seeking a Sr ML Ops Engineer to build and maintain the infrastructure powering their machine learning and AI frameworks, enabling seamless workflows for model training, retraining, and deployment for transformative audio solutions.

Requirements

Expertise in building and maintaining CI/CD pipelines for machine learning applications.
Strong proficiency with containerization (Docker) and orchestration tools (Kubernetes).
Proficiency in deploying machine learning models using frameworks such as TensorFlow Serving, TorchServe, or custom APIs.
Deep understanding of cloud infrastructure and services (AWS, GCP, or Azure) for ML workloads, including GPUs and TPU utilization.
Experience managing large-scale distributed training workflows and optimizing resource allocation.
Familiarity with tools like MLflow, DVC, Weight+Biases, or similar for data and model tracking and versioning.
Strong scripting and programming skills in Python, Bash, or Go.

Responsibilities

Develop, deploy, and maintain scalable infrastructure for machine learning model training, retraining, and inference.
Design and optimize CI/CD pipelines specifically tailored for machine learning workflows, ensuring efficient delivery from research to production.
Implement robust monitoring and logging systems to track model performance and identify potential issues in production environments.
Manage compute resources (cloud and on-premises) to enable large-scale distributed training and inference tasks.
Containerize machine learning models and applications using Docker and deploy them via Kubernetes or equivalent orchestration systems.
Automate deployment workflows for serving ML models using frameworks such as TorchServe, TensorFlow Serving and FastAPI.
Implement model versioning, rollback strategies, and governance for maintaining production stability.

Other

This role is considered Hybrid, which means the employee will work 2-3 days onsite at our Nicasio, CA office and occasionally from home.
5+ years of experience in DevOps, Site Reliability Engineering, or a related role, with at least 2+ years focusing on ML Ops.
Solid understanding of security best practices for machine learning systems and sensitive data handling.
Experience with data orchestration tools like DataChain, Weights and Biases, etc, for managing ML workflows.
Hands-on experience with automated hyperparameter tuning and optimization frameworks.