NVIDIA is looking to grow its company and establish teams with the most thoughtful people in the world. They are seeking a visionary Senior Engineering Manager to lead the Data Center Telemetry team. This role is responsible for driving the architecture, development, and deployment of telemetry solutions at scale for next-generation AI supercomputing platforms.
Requirements
- Strong knowledge of DMTF/PLDM for OOB telemetry collection, time series databases (e.g., InfluxDB, Prometheus) and REST APIs (Redfish).
- Deep understanding of Server and firmware architecture and optimization for low-latency APIs.
- Proven track record of delivering scalable server products and telemetry solutions.
- Experience with SCM (Git, Perforce) and project management tools (Jira).
- Hands-on experience with x86/ARM system architecture and coding (C/C++, Python).
- Familiarity with Confidential Compute and notification systems.
- Demonstrated ability to analyze algorithms for time/space complexity and system resource requirements.
Responsibilities
- Own the end-to-end architecture and delivery for telemetry solutions, including fleet health monitoring, fault remediation, and data visualization at scale.
- Owning OOB telemetry solution and data validation for telemetry from each underlying device.
- Recruit, develop, and motivate a high-performing engineering team focused on platform telemetry, RAS and observability.
- Continuously improve software development processes for optimal productivity and quality.
- Work across teams to ensure seamless integration of telemetry solutions with platform firmware, server architecture, and data center management.
- Drive product life cycles with QA teams, ensuring robust testing, productization, and delivery.
- Hands-on experience with x86/ARM system architecture and coding (C/C++, Python).
Other
- 12+ overall years of relevant experience and 5 yrs of managing systems/platform software teams, ideally in server RAS, firmware, telemetry, or data center solutions.
- BS, MS, or PhD in EE/CS or related field (or equivalent experience).
- Excellent written and oral communication skills, strong work ethic, and commitment to teamwork.
- You are a self-starter who loves to find creative solutions to complicated problems and hands on with coding.
- Experience building and scaling telemetry collection and analysis engines.