Together AI is building the AI Acceleration Cloud, an end-to-end platform for the full generative AI lifecycle, and needs a Senior AI Infrastructure Engineer to play a key role in building the next generation AI cloud platform.
Requirements
- Proficiency in at least one backend programming language (Golang desired)
- Experience with building and operating high-performance and/or globally distributed micro-service architectures across one or more cloud providers (AWS, Azure, GCP)
- Deep experience with Kubernetes internals, such as implementing non-trivial Kubernetes operators, device/storage/network plugins, custom schedulers, or patches thereon or Kubernetes itself
- Deep experience with VMs/hypervisors, such as QEMU/KVM, cloud-hypervisor, VFIO, virtio, PCIE passthrough, Kubevirt, SR-IOV
- Deep experience with DC networking tech + solutions, such as VLAN, VXLAN, VPN, VPC, OVS/OVN
- Experience with Cluster API or similar
- Experience working on high-performance compute, networking, and/or storage
Responsibilities
- Design, build, and maintain performant, secure, and highly-available backend services/operators that run in our data centers and automate hardware management
- Design and build out the IaaS software layer for a new GB200 data center with thousands of GPUs
- Work on a global multi-exabyte high-performance object store, serving massive datasets for pretraining
- Build advanced observability stacks for our customers with automated node lifecycle management for fault-tolerant distributed pretraining
- Perform architecture and research work for decentralized AI workloads
- Work on the core, open-source Together AI platform
- Create services, tools, and developer documentation
Other
- Excellent communication skills – able to write clear design docs and work effectively with both technical and non-technical team members
- 5+ years of professional software development experience
- Strong fundamental software development skills
- Strong systems knowledge and troubleshooting abilities
- Ability to work effectively in a remote work environment