NVIDIA is looking to build and scale AI software that powers breakthroughs in drug discovery and biological sciences.
Requirements
- Python, pytorch expertise
- CI/CD and automation experience: GitHub Actions, YAML workflows, runners, authentication, caching, artifact stores, and release pipelines.
- Distributed training fundamentals: DDP/FSDP, NCCL, mixed precision, data/pipe/tensor parallelism.
- MLOps for AI: Linux, bash, containers (Docker/NGC), SLURM and/or Kubernetes.
- Systems intuition for compute efficiency: kernel optimization, IO/data pipelines, and performance tradeoffs.
Responsibilities
- Own testing-at-scale and reliability pipelines: Build hermetic, reproducible test matrices across GPUs SKUs, multi-node scale, and scientific parameters relevant to the biology space. Create integration and performance test harnesses for large models.
- Productize AI algorithms: Ship LLMs and geometric deep learning models into production-quality services and SDKs; ensure observability, reproducibility, and model/package versioning.
- Develop and deploy distributed learning systems and tools to synchronize and debug workloads on thousands of GPUs.
- Collaborate across teams: Partner with applied research, AI infrastructure, and full‑stack teams; contribute to and upstream improvements across the open‑source ecosystem.
- Be hands‑on: Dive into whatever is needed—infra, glue code, tests, or docs—to unblock the team and ship.
Other
- 3+ years of relevant experience.
- BS/MS in CS, EE, Math, Physics, or equivalent experience.
- Recognized for ownership and technical leadership, with excellent communication and a bias for action.
- Worked in mixed applied science and engineering teams, familiar with the production-grade vs agility balance required to make forward progress.
- A natural interest in biological and physical sciences and desire to continuously learn-as-you-go.