Meta is seeking a Production Systems Engineer to join their Hardware Design and Release to Production - Sustaining (HDRTP) team to ensure the smooth operation of servers and data centers globally, focusing on hardware efficiency, performance, and reliability for AI platforms and large language models.
Requirements
- Troubleshooting and data tooling, including data analysis, building analytical models, and visualizations
- Knowledge of server architecture and components across Compute/Storage/AI Systems/Networking
- Experienced in the integration of lab tools for automated workflows
- Proficient in SQL, Python or C/C++ (data structures, algorithms, and OOP)
- Experience with Linux systems and server systems management
- Experience with some of the following modules/domains: PCIe, Networking, Flash, Memory, CPU, GPU, DRAM (DDR4/5 or HBM)
Responsibilities
- Drive innovation in hardware efficiency by applying expertise in hardware utilization and performance, and translating insights into actionable strategies for hardware, power, performance and data center optimization
- Contribute to industry leading research in hardware characterization and fleet/DC efficiency studies across AI platforms, leveraging data-driven and machine learning analytical techniques
- Conduct in-depth hardware parameter based research and comparative analyses using advanced data analytics and machine learning techniques for failure analysis and diagnosis in production
- Interface with internal hardware, software engineers and operations teams to understand system architectures and failure modes
- Proactively create experiments, data analysis and data visualizations to detect and diagnose hardware health issues, focusing on systemic solutions
- Collaborate on evolving AI platforms, silicon products, thermal and cooling solutions to support the growth of large language models, with a focus on optimizing performance, scalability, and efficiency
- Develop data frameworks and discover insights to answer relationship between hardware, data center parameters and server failures
Other
- Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
- 6+ years of hands-on Software/Firmware/Hardware Engineering to build systems/products for the IT industry
- Master’s degree or PhD in Computer Engineering, Electrical Engineering, or related field