NVIDIA’s Infrastructure organization is seeking to architect and implement distributed observability systems for data centers enabling EDA workflows
Requirements
- Experience developing large scale, distributed observability systems
- Experience with observability platforms such as Apache Spark, Elastic/Open Search, Grafana, Prometheus, and other similar open-source tools
- Python programming experience and use of API calls
- Ability to collaborate with data scientists, researchers, and engineering teams to identify high value data for collection and analysis
- Experience with turning raw data into actionable reports
- Experience with infrastructure software, production application software development, software development, release and support methodology and DevOps
Responsibilities
- Collaborate with HW, and SW engineering teams to deliver observability solutions that meet their needs in EDA clusters
- Develop, test, and deploy data collectors, pipelines, visualization and retrieval services
- Define data collection and retention policies to balance network bandwidth, system load, and storage capacity costs with data analysis requirements
- Work in a diverse team to provide operational and strategic data to empower our engineers and researchers to improve performance, productivity, and efficiency
- Continuously improve quality, workloads, and processes through better observability
Other
- MS (preferred) or BS in Computer Science, Electrical Engineering, or related field or equivalent experience
- 8+ years of proven experience
- Excellent planning and interpersonal skills
- Flexibility/adaptability working in a dynamic environment with changing requirements
- Passion for improving the productivity of others