Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Staff MLOps Engineer - RL Infrastructure

Apptronik

Salary not specified

Oct 17, 2025

Austin, TX, US

Apptronik is building robots for the real world to improve human quality of life and to help solve the ever-increasing labor shortage problem. We are seeking an experienced MLOps Engineer to own and maintain our cutting-edge reinforcement learning (RL) training infrastructure.

Requirements

Strong software engineering fundamentals (testing, code review, documentation, git) and proven experience in a backend or infrastructure role.
Professional experience managing cloud infrastructure on a major cloud platform (e.g., GCP, AWS, Azure).
Hands-on experience with infrastructure-as-code tools (e.g., Terraform, Ansible, CloudFormation).
Familiarity with ML frameworks (PyTorch, TensorFlow) and understanding of model training workflows.
Proficiency with containerization and orchestration technologies (e.g., Kubernetes, Docker).
Understanding of distributed computing concepts and cluster management for compute-intensive workloads.
Solid understanding of Python and experience with scripting for automation and tooling.

Responsibilities

Design, Deploy, and Maintain Infrastructure: Manage and scale our RL training clusters on major cloud platforms (e.g., GCP, AWS, Azure) using infrastructure-as-code principles.
Orchestration and Deployment: Utilize container orchestration tools (e.g., Kubernetes, Docker Swarm) to manage the deployment and scaling of our applications and clusters.
Job Scheduling and Execution: Implement and manage tooling for submitting and monitoring large-scale distributed training jobs using modern distributed computing frameworks (e.g., Ray, Slurm).
Database and Storage Management: Oversee our cloud-native database solutions, ensuring efficient storage and retrieval of large datasets, including images.
Developer Tools: Create SDKs, documentation, and CLI/GUI tooling that make it easy for researchers to launch experiments, visualize results, and debug issues without infrastructure expertise.
System Optimization: Implement robust monitoring, logging, and alerting to ensure the reliability, performance, health and of the training infrastructure.
CI/CD and Automation: Develop and maintain CI/CD pipelines for automated testing, data processing, benchmarking, and model experimentation.

Other

Cross-functional Collaboration: Work closely with AI researchers and robotics engineers to understand pain points, optimize training workflows, and develop solutions that accelerate development cycles.
Bachelor's or Master's degree in Computer Science, Engineering, or a related technical field.
Minimum of 4 years of professional, full-time experience building and maintaining reliable, scalable systems.
Exposure to ML/data engineering infrastructure.
Experience building tools or platforms used by other developers and researchers.