Google's AI Hypercomputer Infrastructure organization needs to bring new Tensor Processing Unit (TPU) and Graphics Processing Unit (GPU) generations to the Google Cloud platform and enable an end-to-end AI/ML compute experience for Google Cloud users. This involves qualifying the complete Cloud TPU and Cloud GPU Hypercomputer stack, building and delivering telemetry infrastructure, optimizing stability and performance for AI/ML workloads, and performing model benchmarking and optimizations during the New Product Introduction (NPI) process.
Requirements
- 8 years of experience in software development.
- 7 years of experience leading technical project strategy, ML design, and optimizing ML infrastructure (e.g., model deployment, model evaluation, data processing, debugging, fine tuning).
- 5 years of experience with design and architecture and testing/launching software products.
- 5 years of experience with one or more of the following: speech/audio (e.g., technology duplicating and responding to the human voice), reinforcement learning (e.g., sequential decision making), ML infrastructure, or specialization in another ML field.
- 8 years of experience with data structures/algorithms.
- 5 years of experience in a technical leadership role leading project teams and setting technical direction.
- 3 years of experience working in a complex, matrixed organization involving cross-functional, or cross-business projects.
Responsibilities
- Build and lead strategic technical alignment with major organizations across Google (PIE, GCE, GKE, GCS, etc.) to accelerate the velocity of bringing the Machine Learning Hardware to Google Cloud with dynamic and engaged Time to Mitigate (TTM) goals.
- Use your extensive expertise in distributed systems and machine learning to develop and execute multi-year plans for validating end to end stack for TPU and GPU Products.
- Ensure that our products deliver performance and stability to make our AI/ML customers successful and meet the demands of the growing business.
- Develop strong collaborative relationships across organizational boundaries with cross-functional teams to achieve the delivery of a high performing end to end stack with components delivered by teams across these organization boundaries.
- Provide technical leadership and mentorship to the team as an overall Über Tech Lead (UTL).
- Build and deliver the telemetry infrastructure for both the GPU/TPU fleet as well as the critical AI/ML workloads.
- Optimize the stability and performance for high priority AI/ML workloads across the stack.
Other
- Note: By applying to this position you will have an opportunity to share your preferred working location from the following: Kirkland, WA, USA; Seattle, WA, USA; San Francisco, CA, USA; Sunnyvale, CA, USA.
- Bachelor’s degree or equivalent practical experience.
- Master’s degree or PhD in Engineering, Computer Science, or a related technical field.
- Google is proud to be an equal opportunity workplace and is an affirmative action employer. We are committed to equal employment opportunity regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity or Veteran status.
- If you have a disability or special need that requires accommodation, please let us know by completing our Accommodations for Applicants form.