Hedra is looking for an ML Engineer to manage and optimize their computational infrastructure for training and deploying machine learning models, specifically their 3DVAE and video diffusion models, ensuring the infrastructure can handle large video datasets and resource-intensive tasks associated with training large generative models.
Requirements
- 3+ YOE in high-performance computing systems
- Experience with cloud computing platforms such as Amazon Web Services, Google Cloud, or Microsoft Azure, essential for managing large-scale ML workloads.
- Values engineering processes and version control (CI/CD).
- Knowledge of containerization technologies like Docker and Kubernetes required for deployments at scale.
- Understanding of distributed training techniques and how to scale models across multi-node clusters aligning with video generation needs.
Responsibilities
- Design, implement, and maintain scalable computing solutions for training and deploying ML models, ensuring infrastructure can handle large video datasets.
- Manage and optimize the performance of our computing clusters or cloud instances, such as AWS or Google Cloud, to support distributed training.
- Ensure that our infrastructure can handle the resource-intensive tasks associated with training large generative models.
- Monitor system performance and implement improvements to maximize efficiency and utilization, using tools like Airflow for orchestration.
- Collaborate across research teams to understand their computational needs and provide appropriate solutions, facilitating seamless model deployment.
Other
- The ideal candidate has diverse experience managing ML workloads at scale, supporting our 3DVAE and video diffusion models.
- We encourage you to apply even if you don't meet every requirement — we value curiosity, creativity, and the drive to solve hard problems.
- Bachelor’s degree in Computer Science, Information Technology, or a related field, with a focus on system administration.
- Strong problem-solving and communication skills, given the need to collaborate with diverse teams.
- Our team is fully in-person in SF/NY with a shared love for whiteboard problem-solving.