Google is seeking to solve the critical issue of scaling Machine Learning (ML) workloads to counteract the limitations of Moore's Law. The ML Supercomputers team aims to deliver easy-to-use and maintainable software for the reliable scale-out and scale-up of accelerators, specifically targeting massive-scale ML applications.
Requirements
- Bachelor's degree or equivalent practical experience.
- 8 years of experience in software development.
- 5 years of experience testing, and launching software products.
- 5 years of experience building and developing large-scale infrastructure, distributed systems or networks, or experience with compute technologies, storage, or hardware architecture.
- 3 years of experience with software design and architecture.
- Master’s degree or PhD in Engineering, Computer Science, or a related technical field.
- 8 years of experience with data structures/algorithms.
Responsibilities
- Design and maintain supercomputer software across different layers of the software stack (e.g., network routing rules built into Tensor Processing Units (TPUs), control software running on specialized machines, distributed software running on Google’s internal and cloud infrastructure).
- Control, monitor, build, deploy, qualify, and service supercomputing systems.
- Provide technical leadership to help formulate and drive software development plans.
- Identify commonalities between different supercomputer generations and accelerator types and create well abstracted and flexible software.
- Help identify dependencies in cross-functional teams and drive common execution with a focus on development velocity and quality.
Other
- Bachelor's degree or equivalent practical experience.
- Master’s degree or PhD in Engineering, Computer Science, or a related technical field.
- 3 years of experience in a technical leadership role leading project teams and setting technical direction.
- 3 years of experience working in a complex, matrixed organization involving cross-functional, or cross-business projects.
- Knowledge of common ML algorithms and how they map to software and hardware operations.