NVIDIA is looking to accelerate the next era of machine learning innovation by ensuring delivery of functional, reliable, secure, and performance-optimal GPU clusters to internal researchers
Requirements
- Experience in software development lifecycle on Linux-based platforms
- Strong coding skills in languages such as Python, C++ or Rust
- Experience with Docker, Kubernetes, GitLab CI, automated deployments
- Experience with AIOps or Agentic AI and apply it successfully in production environment
- Experience running Slurm or custom scheduling frameworks in production ML environments
- Familiarity with GPU computing, Linux systems internals, and performance tuning at scale
- Proficiency with full-stack development: Relational Data Modeling, DB optimization, REST API Semantics, Javascript, CSS, providing API as a service
Responsibilities
- propose and implement engineering solutions to ensure delivery of functional, reliable, secure, and performance-optimal GPU clusters to internal researchers
- design, develop and maintain engineering solutions to solve pain points of validating, monitoring and operating GPU clusters at scale
- research in traditional AIOps and the emerging Agentic AI, and leverage it to further reduce the operation toil
- participate in on-call support for systems, platforms built and owned by the team
- work with coworkers across the AI Platform organization to understand the pain points of validating, monitoring and operating GPU clusters at scale
- empower scientists and engineers to train, fine-tune, and deploy the most advanced ML models on some of the world’s most powerful GPU systems
- enable internal researchers to focus on training and development by reducing operational disruption and overhead
Other
- BS/MS in Computer Science, Engineering, or equivalent experience
- 8+ years in software/platform engineering, including 3+ years in ML infrastructure or distributed systems
- Passion for building developer-centric platforms with great UX and strong operational reliability
- Ability to work in a diverse environment
- Commitment to fostering a diverse work environment