Apptronik is building robots for the real world to improve human quality of life and to help solve the ever-increasing labor shortage problem. We are seeking an experienced MLOps Engineer to own and maintain our cutting-edge reinforcement learning (RL) training infrastructure.
Requirements
- Strong software engineering fundamentals (testing, code review, documentation, git) and proven experience in a backend or infrastructure role.
- Professional experience managing cloud infrastructure on a major cloud platform (e.g., GCP, AWS, Azure).
- Hands-on experience with infrastructure-as-code tools (e.g., Terraform, Ansible, CloudFormation).
- Familiarity with ML frameworks (PyTorch, TensorFlow) and understanding of model training workflows.
- Proficiency with containerization and orchestration technologies (e.g., Kubernetes, Docker).
- Understanding of distributed computing concepts and cluster management for compute-intensive workloads.
- Solid understanding of Python and experience with scripting for automation and tooling.
Responsibilities
- Design, Deploy, and Maintain Infrastructure: Manage and scale our RL training clusters on major cloud platforms (e.g., GCP, AWS, Azure) using infrastructure-as-code principles.
- Orchestration and Deployment: Utilize container orchestration tools (e.g., Kubernetes, Docker Swarm) to manage the deployment and scaling of our applications and clusters.
- Job Scheduling and Execution: Implement and manage tooling for submitting and monitoring large-scale distributed training jobs using modern distributed computing frameworks (e.g., Ray, Slurm).
- Database and Storage Management: Oversee our cloud-native database solutions, ensuring efficient storage and retrieval of large datasets, including images.
- Developer Tools: Create SDKs, documentation, and CLI/GUI tooling that make it easy for researchers to launch experiments, visualize results, and debug issues without infrastructure expertise.
- System Optimization: Implement robust monitoring, logging, and alerting to ensure the reliability, performance, health and of the training infrastructure.
- CI/CD and Automation: Develop and maintain CI/CD pipelines for automated testing, data processing, benchmarking, and model experimentation.
Other
- Cross-functional Collaboration: Work closely with AI researchers and robotics engineers to understand pain points, optimize training workflows, and develop solutions that accelerate development cycles.
- Bachelor's or Master's degree in Computer Science, Engineering, or a related technical field.
- Minimum of 4 years of professional, full-time experience building and maintaining reliable, scalable systems.
- Exposure to ML/data engineering infrastructure.
- Experience building tools or platforms used by other developers and researchers.