Bloomberg is seeking an engineer to join their hardware management team to manage and support thousands of servers, including the entire AI stack, ensuring peak performance and reliability of HPC/AI clusters.
Requirements
- 4+ years of proficiency in Kubernetes environments (deployments, storage, services, jobs, ingress, egress, etc)
- Hands-on management of GPU-based systems, including kernel and driver management, and developing software tooling to automate provisioning and maintenance of these systems.
- Design, implemented, and maintained system software that enables communication between GPUS, CPUs, and storage in scale-out AI and HPC systems
- Oversee the ongoing monitoring, support, and maintenance of our HPC/AI clusters, ensuring peak performance and reliability
- Drive system upgrades, customization, and seamless integration with software developers, network operations, and data center teams
- Manage and maintain a diverse range of computer systems and application software, ensuring they meet the highest standards of functionality and efficiency
- Develop and maintain expertise in low-latency/high-bandwidth, interconnected infrastructure (including InfiniBand, Ethernet, RDMA/RoCE, and others)
Responsibilities
- Design, build, and maintain highly reliable, scalable, and efficient infrastructure platforms that support our engineering teams and business needs.
- Participate in system design discussions and contribute to architectural decisions
- Ensure code quality through standard methodologies, code reviews, and alignment to clean code principles
- Be able to produce clear and consumable documentation for a wide audience
- Hands-on management of GPU-based systems, including kernel and driver management, and developing software tooling to automate provisioning and maintenance of these systems.
- Design, implemented, and maintained system software that enables communication between GPUS, CPUs, and storage in scale-out AI and HPC systems
- Oversee the ongoing monitoring, support, and maintenance of our HPC/AI clusters, ensuring peak performance and reliability
Other
- Communicate effectively across diverse teams
- Be willing to participate in on-call rotations as arranged
- Be a self starter, manage priorities, and work independently
- Stay up-to-date with the latest infrastructure technologies, and industry standard processes, and evaluate their potential impact on existing and future solutions
- Hold yourself to high standards