NVIDIA DGX Cloud engineering has a mission to ensure our customers receive timely and quality-assured releases. We are seeking a Performance Engineer proficient in performance and scalability testing, identifying limitations across the Kubernetes (K8s) and application stack using industry standard tools and telemetry.
Requirements
- 5+ years in software engineering with a strong track record in performance or scalability of high-scale distributed systems
- Are deeply comfortable with performance profiling tools and tracing systems
- Be able to identify performance issues, root cause problems, and be able to come up with potential solutions
- Experience optimizing performance across one or more layers of the stack (e.g., database, networking, storage, application runtime, GC tuning, Golang internals, GPU utilization)
- Contributed to observability, benchmarking, or performance-focused infrastructure at scale
- Strong understanding of OS internals, scheduling, memory management, and IO patterns
- Proficient in container-based infrastructure (Docker, Kubernetes, Helm)
Responsibilities
- Analyze and optimize performance across application, middleware, runtime, and infrastructure layers—networking, storage, GPU utilization, and beyond
- Develop tooling and metrics that provide deep observability into system performance
- Collaborate closely with infra, platform, runtime, and product teams to identify key performance goals and drive systemic improvements
- Lead investigations into high-impact performance regressions or scalability issues in production
- Influence architecture and design decisions to prioritize latency, throughput, and efficiency at scale
- Drive performance testing strategies and help define SLAs/SLOs around latency and throughput for critical systems
Other
- If you excel in problem-solving, can think creatively on your feet, and enjoy working in a distributed team setting, we would love to have you join us!
- Have demonstrated success navigating ambiguity and aligning stakeholders around performance goals
- Demonstrated ability to handle sophisticated technical environments while meeting or exceeding all security, reliability, scalability, and availability metrics
- Strong and confirmed knowledge of modern architectures at scale
- If you're a creative and autonomous engineer with a real passion for technology, we want to hear from you.