xAI needs to optimize its network performance and availability to support its AI training and inference workloads.
Requirements
- Deep understanding of congestion control on ethernet with Infiniband an added bonus
- Deep understanding of AI training and inference workloads and how they operate on the network
- Expertise in creating a portfolio of metrics for performance and operations
- Experience with Python to automate away repetitive tasks and facilitate daily job
- Deep experience in RoCEv2
- Ability to use and debug NCCL and potentially commit to the library
- Experience designing and operating large scale networks with 5 years in the ethernet AI/HPC space
Responsibilities
- Develop at hyper scale while optimizing performance and availability
- Build metric dashboards and tweak configurations to ensure no performance is left on the table
- Help design the next iteration of our backend and front-end networks
- Spend most days deep inside NCCL
- Participate in a team on-call rotation and help on other scaling and maintenance efforts
- Contribute to deployment and operations frameworks to remove repetitive tasks
- Optimize the fleet for training and inference traffic
Other
- 10 years designing and operating large scale networks
- 5 years in the ethernet AI/HPC space
- Significant travel expected to Memphis, Tennessee for data center buildouts and to the head office in Palo Alto for team collaboration
- Strong communication skills
- Work ethic and strong prioritization skills
- Ability to concisely and accurately share knowledge with teammates