NVIDIA is looking to scale up its AI Infrastructure by hiring experienced software engineers with Kubernetes experience to help build and deploy leading infrastructure solutions for AI-based applications.
Requirements
- significant software engineering experience with kubernetes including cluster operations, operator development, node health monitoring and working with GPU resource scheduling.
- Software development experience with kubernetes APIs and frameworks not just operating a cluster.
- Technical knowledge, including a systems programming language (Go, Python) and a solid understanding of data structures and algorithms.
- Technical competency in managing and automating large-scale distributed systems independent of cloud providers.
- Advanced hands-on experience and deep understanding of cluster management systems (Kubernetes, Slurm, Bright Cluster Manager)
- Proven operational excellence in maintaining reliable and performant AI infrastructure.
Responsibilities
- working on custom software related to scheduling GPU resources on kubernetes.
- Implementing monitoring and health management capabilities that enable industry leading reliability, availability, and scalability of GPU assets.
- harnessing multiple data streams, ranging from GPU hardware diagnostics to cluster and network telemetry.
- Working with teams across NVIDIA to ensure production AI clusters run reliability and consistently with maximum performance.
- Evaluating system failures and improving services based on a well-defined incident management process.
Other
- Highly motivated with strong communication skills, you can work successfully with multi-functional teams, principles, and architects and coordinate effectively across organizational boundaries and geographies.
- 5+ years in similar role and experience on large-scale production systems.
- You possess a BS in Computer Science, Engineering, Physics, Mathematics or a comparable Degree or equivalent experience.
- creative and autonomous
- Applications for this job will be accepted at least until August 25, 2025.