Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

NVIDIA Logo

Software Engineer, AI Resiliency - New College Grad 2025

NVIDIA

$104,000 - $189,750
May 9, 2025
Santa Clara, CA, US
Apply Now

Developing AI software resiliency for the most powerful AI supercomputers in the world to drive down cluster downtime towards zero.

Requirements

  • Proficiency in C++ and Python
  • Strong understanding of distributed systems concepts, parallel programming, and fault tolerance in large-scale computing environments
  • Familiarity with AI frameworks such as PyTorch, JAX/XLA, TensorFlow, or similar
  • Experience with debugging and profiling tools (e.g., gdb, perf, valgrind, NVIDIA Nsight)
  • Hands-on experience with CUDA, NCCL, or MPI for GPU-accelerated computing
  • Knowledge of checkpointing strategies, error mitigation, or fault-tolerant computing in AI training
  • Experience working with large-scale AI clusters, HPC environments, or cloud-based AI workloads

Responsibilities

  • Develop AI Software Resiliency Features
  • Hands-On Coding & Optimization
  • Fault Tolerance & Debugging
  • Collaborate Across Teams
  • Testing & Automation
  • Support Production Deployments

Other

  • Pursuing or recently completed a Bachelor’s, Master’s or PhD in Computer Science, Electrical Engineering, or a related field, or equivalent experience
  • Excellent problem-solving skills and ability to work in a fast-paced, highly collaborative environment