The company is looking to develop and optimize software infrastructure and tools for training cutting-edge AI models.
Requirements
Proficiency in Python, C++, or similar and at least one deep learning library such as PyTorch, TensorFlow, JAX, etc.
Strong background in distributed computing, parallel processing techniques, handling large-scale datasets and data preprocessing.
Deep understanding of state-of-the-art machine learning techniques and models.
Experience with cloud-based training environments (AWS, Google Cloud, Azure).
Experience in developing and maintaining software tooling and infrastructure for machine learning.
Deep understanding and practical experience with software engineering principles, including algorithms, data structures, and system design.
Experience with continuous integration and automated testing frameworks.
Responsibilities
Develop and maintain robust, scalable, and distributed training pipelines (data preprocessing, training orchestration, and model evaluation) and frameworks for large-scale AI models.
Optimize training processes for performance and resource utilization, ensuring scalability and reliability.
Collaborate with researchers and machine learning engineers to integrate state-of-the-art algorithms and techniques into training pipelines.
Monitor and analyze training, identifying bottlenecks and proposing solutions to improve efficiency and performance.
Ensure the robustness and reliability of the training infrastructure, including automated testing and continuous integration.
Other
BS, MS or higher degree in Computer Science, Robotics, Engineering or a related field, or equivalent practical experience.