Waymo is looking to improve the runtime efficiency of input data pipelines for large-scale training workloads in their ML Infrastructure team, specifically for models in Perception and Planning for autonomous driving software.
Requirements
- Proficient in distributed systems design with an understanding of ML data pipeline optimization.
- Experience with ML frameworks, including TensorFlow and JAX.
- Hands-on experience libraries like Grain or tf.data service.
- Solid programming skills in Python and C++.
- Practical familiarity with profiling tools to uncover performance bottlenecks.
- Familiarity with distributed dataflow frameworks like ML Pathways.
Responsibilities
- Design, and improve distributed input data pipelines for large-scale ML training workloads.
- Collaborate with researchers and ML engineers to resolve bottlenecks in data pipeline performance.
- Improve runtime goodput of ML training workload, including optimizing input data processing systems, ensuring scalability and reliability across distributed environments.
- Implement and maintain advanced ML infrastructure tools, including ML Pathways, Grain, JAX, and TensorFlow.
- Evaluate and integrate modern technologies to enhance the performance and scalability of ML systems.
- Promote best practices for distributed systems architecture and contribute to technical leadership within the team.
Other
- B.S. in Computer Science, Math, or 8+ years equivalent real-world experience.
- MS in Computer Science, Math
- LI-Hybrid