NVIDIA is looking to develop and operate enterprise GPU infrastructure management systems across Clouds to support NVIDIA products across HPC, Cloud, and enterprise on both bare metal and virtualized platforms as the role of GPUs in all of these environments expands.
Requirements
- strong Kubernetes and SRE background
- Deep understanding and execution skills of all aspects of the software development lifecycle
- Experience with OpenAPI and Kubernetes Custom Resource Definitions
- Open-Source contributions to the Cloud-Native community and an understanding of AI and LLM principles
- Strong experience with GitHub/GitLab CI/CD pipelines and application configuration
- Strong knowledge of container technologies, orchestration frameworks and observability systems
- Exposure to GPU programming with CUDA and familiarity with Kubernetes internals. Experience in developing Kubernetes operators.
Responsibilities
- operate, design and build infrastructure management systems, Kubernetes operators, and end-to-end HPC integration solutions that combine GPUs with the rest of the datacenter software management ecosystem
- Enable GPU provisioning and life-cycle with state-of-the-art Cloud-Native open-source ecosystem solutions, including Kubernetes, Docker, Prometheus, TerraForm and Crossplane
- Develop, maintain and/or operate robust, scalable Go programs in a Kubernetes environment
- Develop the next-generation multi-cloud infrastructure management systems to support GenAI
- Support internal and external users through bug fixes, documentation, and feature improvements
- Maintain high-quality products through robust test coverage and Day 2 capabilities
Other
- Business level English, outstanding written and verbal interpersonal skills
- Strong motivation and commitment to learn new skills
- Ability to manage time in a fast, heavily multitasked environment
- If you're creative and autonomous, we want to hear from you!
- NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer.