Pinterest is seeking a Staff Software Engineer, Capacity Engineering to manage and optimize the ML infrastructure, focusing on efficiency as a strategic priority for the company.
Requirements
- Deep understanding of GPU Architectures, Pytorch, etc.
- Deep understanding of supporting parts of ML software stack like Scheduling, Data and Storage
- Hands on experience with shared platforms like Kubernetes
- Strong technical and performance engineering skills to collaborate with stakeholders on complex and ambiguous technical challenges
- Experience building and managing highly available distributed applications at scale
- Proficiency in software development languages such as Java, Python and C++
- Understanding of ML Models, Kernels and optimization opportunities
- Hands-on experience with large, cloud-native multi-tenant platforms at Internet scale
- Experience with AWS or similar cloud environments
- Deep understanding of infrastructure capacity and performance
Responsibilities
- Manage the ML hardware capacity that powers the models running at Pinterest
- Improve the efficiency of ML Infrastructure at Pinterest
- Build develop and mature profiling and optimization capabilities for ML Infrastructure at Pinterest scale
- Collaborate with ML Platform, Infrastructure Engineering and SRE teams in their mission to deliver highly available, resilient, secure and efficient ML foundations for Pinterest’s tech stack
Other
- This role will need to be in the office for in-person collaboration 1-2 times/quarter and therefore can be situated anywhere in the country.
- This position is not eligible for relocation assistance.
- Excellent skills in communicating complex technical issues
- US based applicants only