xAI is looking to build cutting-edge software, services, and frameworks to empower their Network Development Engineers to manage and automate network operations for their large-scale GPU supercomputing network fabrics, ultimately supporting the company's mission of accelerating human scientific discovery through AI.
Requirements
- Python
- Go
- TCP/IP
- BGP
- RDMA
- Implement IaC best practices, enhancing deployment pipelines, and ensuring robust, secure service delivery across our production environments
- Expert knowledge and proven history with designing scalable and reliable software from the ground up that can build and orchestrate tens of thousands of network devices at lightning speeds
Responsibilities
- build cutting-edge software, services, and frameworks to empower our Network Development Engineers
- tackle all facets of network management—metric collection, configuration, zero-touch provisioning, monitoring, and auto-remediation
- driving automation-first solutions for xAI’s production and ancillary networks
- develop extensible tools
- streamline complex processes
- ensure rock-solid reliability to support xAI’s mission of accelerating human scientific discovery through AI
- Building software and tools with extensive metrics coverage for some of the world’s largest GPU supercomputing network fabrics used for AI training and serving customer inference queries
Other
- All employees are expected to be hands-on and to contribute directly to the company’s mission.
- Leadership is given to those who show initiative and consistently deliver excellence.
- Work ethic and strong prioritization skills are important.
- All engineers are expected to have strong communication skills.
- They should be able to concisely and accurately share knowledge with their teammates.