Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Disney Entertainment & ESPN Technology Logo

Sr ML Ops Engineer

Disney Entertainment & ESPN Technology

$152,100 - $203,900
Sep 18, 2025
Nicasio, CA, USA
Apply Now

Skywalker Sound Development Group is seeking a Sr ML Ops Engineer to build and maintain the infrastructure powering their machine learning and AI frameworks, enabling seamless workflows for model training, retraining, and deployment for transformative audio solutions.

Requirements

  • Expertise in building and maintaining CI/CD pipelines for machine learning applications.
  • Strong proficiency with containerization (Docker) and orchestration tools (Kubernetes).
  • Proficiency in deploying machine learning models using frameworks such as TensorFlow Serving, TorchServe, or custom APIs.
  • Deep understanding of cloud infrastructure and services (AWS, GCP, or Azure) for ML workloads, including GPUs and TPU utilization.
  • Experience managing large-scale distributed training workflows and optimizing resource allocation.
  • Familiarity with tools like MLflow, DVC, Weight+Biases, or similar for data and model tracking and versioning.
  • Strong scripting and programming skills in Python, Bash, or Go.

Responsibilities

  • Develop, deploy, and maintain scalable infrastructure for machine learning model training, retraining, and inference.
  • Design and optimize CI/CD pipelines specifically tailored for machine learning workflows, ensuring efficient delivery from research to production.
  • Implement robust monitoring and logging systems to track model performance and identify potential issues in production environments.
  • Manage compute resources (cloud and on-premises) to enable large-scale distributed training and inference tasks.
  • Containerize machine learning models and applications using Docker and deploy them via Kubernetes or equivalent orchestration systems.
  • Automate deployment workflows for serving ML models using frameworks such as TorchServe, TensorFlow Serving and FastAPI.
  • Implement model versioning, rollback strategies, and governance for maintaining production stability.

Other

  • This role is considered Hybrid, which means the employee will work 2-3 days onsite at our Nicasio, CA office and occasionally from home.
  • 5+ years of experience in DevOps, Site Reliability Engineering, or a related role, with at least 2+ years focusing on ML Ops.
  • Solid understanding of security best practices for machine learning systems and sensitive data handling.
  • Experience with data orchestration tools like DataChain, Weights and Biases, etc, for managing ML workflows.
  • Hands-on experience with automated hyperparameter tuning and optimization frameworks.