Frontier AI Robotics is seeking a Machine Learning Systems Engineer to build and optimize distributed training infrastructure for large-scale machine learning models, particularly in deep learning and transformer-based architectures, to power state-of-the-art AI research and applications.
Requirements
- Design, build, and optimize machine learning infrastructure for large-scale training and inference.
- Apply PyTorch, Python, and C++ skills to engineer modular, scalable ML systems.
- Evaluate and implement parallelism techniques such as data, tensor, model, and pipeline parallelism.
- Monitor and optimize GPU memory and throughput for training large models efficiently.
- Deep understanding of LLM algorithm and deep learning framework like PyTorch.
- Mathematics and Statistics: Strong understanding of linear algebra, calculus, probability, and statistics.
- 3+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
Responsibilities
- Design, build, and optimize machine learning infrastructure for large-scale training and inference.
- Apply PyTorch, Python, and C++ skills to engineer modular, scalable ML systems.
- Evaluate and implement parallelism techniques such as data, tensor, model, and pipeline parallelism.
- Monitor and optimize GPU memory and throughput for training large models efficiently.
- Collaborate cross-functionally with research, data infra teams to integrate new models and features.
- Deep understanding of LLM algorithm and deep learning framework like PyTorch.
- Mathematics and Statistics: Strong understanding of linear algebra, calculus, probability, and statistics.
Other
- 3+ years of non-internship professional software development experience
- 2+ years of non-internship design or architecture (design patterns, reliability and scaling) of new and existing systems experience
- Experience programming with at least one software programming language
- work safely and cooperatively with other employees, supervisors, and staff
- adhere to standards of excellence despite stressful conditions