Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Xai Logo

AI/HPC Network Development Engineer - xAI Networking

Xai

Salary not specified
Aug 20, 2025
Memphis, TN, US
Apply Now

xAI needs to optimize its network performance and availability to support its AI training and inference workloads.

Requirements

  • Deep understanding of congestion control on ethernet with Infiniband an added bonus
  • Deep understanding of AI training and inference workloads and how they operate on the network
  • Expertise in creating a portfolio of metrics for performance and operations
  • Experience with Python to automate away repetitive tasks and facilitate daily job
  • Deep experience in RoCEv2
  • Ability to use and debug NCCL and potentially commit to the library
  • Experience designing and operating large scale networks with 5 years in the ethernet AI/HPC space

Responsibilities

  • Develop at hyper scale while optimizing performance and availability
  • Build metric dashboards and tweak configurations to ensure no performance is left on the table
  • Help design the next iteration of our backend and front-end networks
  • Spend most days deep inside NCCL
  • Participate in a team on-call rotation and help on other scaling and maintenance efforts
  • Contribute to deployment and operations frameworks to remove repetitive tasks
  • Optimize the fleet for training and inference traffic

Other

  • 10 years designing and operating large scale networks
  • 5 years in the ethernet AI/HPC space
  • Significant travel expected to Memphis, Tennessee for data center buildouts and to the head office in Palo Alto for team collaboration
  • Strong communication skills
  • Work ethic and strong prioritization skills
  • Ability to concisely and accurately share knowledge with teammates