Red Hat is looking to advance its open-source AI initiatives by developing scalable inference infrastructure and optimizing large language model deployments, impacting enterprise AI deployment worldwide.
Requirements
- Proficiency in Python and system programming languages such as Golang, Rust, or C++
- Deep understanding of computer architecture, parallel processing, and distributed computing concepts
- Experience with Kubernetes ecosystem, including custom APIs, operators, and Gateway API inference extension
- Knowledge of cloud-native technologies like Istio, Cilium, Envoy, and CNI
- Experience with tensor math libraries such as PyTorch
- Familiarity with high-performance networking protocols including UCX, RoCE, InfiniBand, and RDMA
- Expertise in GPU performance optimization and kernel optimization for deep neural networks
Responsibilities
- Develop and maintain distributed inference infrastructure leveraging Kubernetes APIs, operators, and Gateway Inference Extension API for scalable LLM deployment
- Design and implement systems components in Go and/or Rust to integrate with vLLM and manage inference workloads
- Create KV-cache aware routing and scoring algorithms to optimize memory utilization and request distribution across large-scale deployments
- Enhance resource utilization, fault tolerance, and stability of the inference stack
- Contribute to the design, development, and testing of inference optimization algorithms
- Conduct code reviews with a focus on quality, performance, and maintainability
- Mentor and guide fellow engineers, fostering a culture of continuous learning and innovation
Other
- Strong communication skills, with the ability to collaborate across technical and non-technical teams
- BS, MS in computer science, computer engineering, or a related field; PhD in an ML-related domain is a plus