OCI is driving development of next generation hyperscalar GPU data centers built on Nvidia and AMD GPUs. OCI enables popular AI services such as openAI on GPU compute servers.
Requirements
- GPU device drivers and the runtime libraries (CUDA and ROCM)
- GPU architectural concepts such as UVM, host to device and device to host interactions including able to quantify performance issues in all such interactions
- C programming
- Python or other scripting language used in AI/GPU environments
- Nvidia and AMD GPU architecture
- GPU drivers
- the entire boot process including touch points with the BIOS and the BMC subsystems
Responsibilities
- building and debugging issues that occur in the GPU drivers and Linux kernels that interact with GPU stack including functional and performance issues when running GPU AI/ML/inference workloads
- use all standard tools targeted performance and stress such as DCGM, NCCL and RCCL suites
- debugging and diagnosing issues in the system reported via RAS events notified via the GPU BMC and other monitoring agents
- building and debugging issues related to them
- debugging issues that are seen during new product bring up and at data centers running customer workloads including driving those issues with GPU vendors to resolution
- take vendor SW drops and build customized drivers against Oracle Linux and Ubuntu distributions, unit test functionality and run GPU workloads to validate performance using standard benchmarks
- engage with cross functional teams such as the HW and FW teams to debug issues
Other
- on call periodically to handle OCI data center escalations
- strong technical and communication skills
- engage with cross functional teams such as the HW and FW teams