DigitalOcean is looking to revolutionize cloud computing by enhancing its Bare Metal GPU product. The company aims to provide security and operational best practices for its infrastructure servers, develop self-service capabilities for customers through reliable API offerings, and improve the overall product experience by addressing hardware and software performance issues.
Requirements
- Proven ability to orchestrate bare metal linux systems at scale including building automation for firmware updates, bios config management, configuring PXE environments.
- Deep Linux systems experience including low level troubleshooting, developing and applying configuration management, security best practices and monitoring and alerting.
- Strong automation mindset. Expert knowledge in 1 or more orchestration tools such as MaaS, Salt, Chef, Ansible or Puppet.
- Hands-on experience in High Performance Computing (HPC) clustered environments from Nvidia or AMD.
- Experience in performing automated wide scale testing on NCCL or other frameworks.
- Network engineering experience with VyOS platforms.
Responsibilities
- Contribute to a rapidly growing Bare Metal GPU product within DO by providing security and operational best practices to a fleet of infrastructure servers across multiple regions.
- Help design and implement further self-service capabilities for our customers by providing reliable and predictable API capabilities for upstack service teams.
- Engage in support escalations when necessary. Capture trends and lead internal projects to improve the overall product experience.
- Continuously test our hardware platforms to identify performance regressions related to firmware, software or hardware issues.
Other
- Strong communication skills. Your job will involve writing detailed documentation for others to pick up or leading knowledge sharing sessions with operations teams.
- This is a remote role