Skywalker Sound Development Group is seeking a Sr ML Ops Engineer to build and maintain the infrastructure powering their machine learning and AI frameworks, enabling seamless workflows for model training, retraining, and deployment for transformative audio solutions.
Requirements
- Expertise in building and maintaining CI/CD pipelines for machine learning applications.
- Strong proficiency with containerization (Docker) and orchestration tools (Kubernetes).
- Proficiency in deploying machine learning models using frameworks such as TensorFlow Serving, TorchServe, or custom APIs.
- Deep understanding of cloud infrastructure and services (AWS, GCP, or Azure) for ML workloads, including GPUs and TPU utilization.
- Experience managing large-scale distributed training workflows and optimizing resource allocation.
- Familiarity with tools like MLflow, DVC, Weight+Biases, or similar for data and model tracking and versioning.
- Strong scripting and programming skills in Python, Bash, or Go.
Responsibilities
- Develop, deploy, and maintain scalable infrastructure for machine learning model training, retraining, and inference.
- Design and optimize CI/CD pipelines specifically tailored for machine learning workflows, ensuring efficient delivery from research to production.
- Implement robust monitoring and logging systems to track model performance and identify potential issues in production environments.
- Manage compute resources (cloud and on-premises) to enable large-scale distributed training and inference tasks.
- Containerize machine learning models and applications using Docker and deploy them via Kubernetes or equivalent orchestration systems.
- Automate deployment workflows for serving ML models using frameworks such as TorchServe, TensorFlow Serving and FastAPI.
- Implement model versioning, rollback strategies, and governance for maintaining production stability.
Other
- This role is considered Hybrid, which means the employee will work 2-3 days onsite at our Nicasio, CA office and occasionally from home.
- 5+ years of experience in DevOps, Site Reliability Engineering, or a related role, with at least 2+ years focusing on ML Ops.
- Solid understanding of security best practices for machine learning systems and sensitive data handling.
- Experience with data orchestration tools like DataChain, Weights and Biases, etc, for managing ML workflows.
- Hands-on experience with automated hyperparameter tuning and optimization frameworks.