Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

NVIDIA Logo

Senior Software Engineer - AI Research Clusters

NVIDIA

$184,000 - $356,500
Dec 2, 2025
Santa Clara, CA, US
Apply Now

NVIDIA is looking to accelerate the next era of machine learning innovation by ensuring delivery of functional, reliable, secure, and performance-optimal GPU clusters to internal researchers

Requirements

  • Experience in software development lifecycle on Linux-based platforms
  • Strong coding skills in languages such as Python, C++ or Rust
  • Experience with Docker, Kubernetes, GitLab CI, automated deployments
  • Experience with AIOps or Agentic AI and apply it successfully in production environment
  • Experience running Slurm or custom scheduling frameworks in production ML environments
  • Familiarity with GPU computing, Linux systems internals, and performance tuning at scale
  • Proficiency with full-stack development: Relational Data Modeling, DB optimization, REST API Semantics, Javascript, CSS, providing API as a service

Responsibilities

  • propose and implement engineering solutions to ensure delivery of functional, reliable, secure, and performance-optimal GPU clusters to internal researchers
  • design, develop and maintain engineering solutions to solve pain points of validating, monitoring and operating GPU clusters at scale
  • research in traditional AIOps and the emerging Agentic AI, and leverage it to further reduce the operation toil
  • participate in on-call support for systems, platforms built and owned by the team
  • work with coworkers across the AI Platform organization to understand the pain points of validating, monitoring and operating GPU clusters at scale
  • empower scientists and engineers to train, fine-tune, and deploy the most advanced ML models on some of the world’s most powerful GPU systems
  • enable internal researchers to focus on training and development by reducing operational disruption and overhead

Other

  • BS/MS in Computer Science, Engineering, or equivalent experience
  • 8+ years in software/platform engineering, including 3+ years in ML infrastructure or distributed systems
  • Passion for building developer-centric platforms with great UX and strong operational reliability
  • Ability to work in a diverse environment
  • Commitment to fostering a diverse work environment