Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Software Engineer, AI Resiliency - New College Grad 2025

$104,000 - $189,750

May 9, 2025

Santa Clara, CA, US

Developing AI software resiliency for the most powerful AI supercomputers in the world to drive down cluster downtime towards zero.

Proficiency in C++ and Python
Strong understanding of distributed systems concepts, parallel programming, and fault tolerance in large-scale computing environments
Familiarity with AI frameworks such as PyTorch, JAX/XLA, TensorFlow, or similar
Experience with debugging and profiling tools (e.g., gdb, perf, valgrind, NVIDIA Nsight)
Hands-on experience with CUDA, NCCL, or MPI for GPU-accelerated computing
Knowledge of checkpointing strategies, error mitigation, or fault-tolerant computing in AI training
Experience working with large-scale AI clusters, HPC environments, or cloud-based AI workloads

Pursuing or recently completed a Bachelor’s, Master’s or PhD in Computer Science, Electrical Engineering, or a related field, or equivalent experience
Excellent problem-solving skills and ability to work in a fast-paced, highly collaborative environment