NVIDIA is looking to build the machine learning brain that keeps NVIDIA’s global DGX Cloud healthy, efficient and ready for the next waves of AI breakthroughs by turning billions of telemetry signals into predictive insight.
Requirements
- 8+ years experience applying Machine Learning to operational systems.
- Proven track record of building and deploying Machine Learning models in production environments.
- Experience with time series analysis and optimization algorithms.
- Familiarity with distributed systems and cloud platforms such as AWS and Kubernetes.
- Strong software engineering skills and proficiency in Python.
- Experience with machine learning frameworks such as TensorFlow, PyTorch, or similar.
- Experience solving capacity planning problems.
Responsibilities
- Ground breaking and developing innovative machine learning algorithms and models that propel our AI products.
- Build production models for anomaly detection, predictive maintenance and usage optimization.
- Develop tools surfacing real time telemetry, efficiency metrics and long term trends.
- Develop forecasting and simulation models for global scale planning.
- Analyzing complex datasets to determine the best approach for model training and optimization.
- Translate findings into clear engineering actions with infrastructure, operations and product teams.
- Participating in cross-functional projects to integrate machine learning capabilities into various NVIDIA products.
Other
- Master's degree or PhD in Mathematics, Statistics, Machine Learning or related quantitative field (or equivalent experience).
- Effective verbal/written communication, and technical presentation skills.
- A track record of delivering high-impact projects to compete in a fast-paced environment.
- Deep understanding of GPU performance metrics.
- Familiarity with prometheus and PromQL.