DigitalOcean is looking to solve the problem of simplifying LLM hosting, serving, and optimization for millions of users by building a new product that brings their famed DigitalOcean Simplicity to the world of LLM inference services.
Requirements
- 10+ years of experience in software engineering, which should include 2+ years building AI/ML technologies (ideally related to LLM hosting and inference).
- Enduring interest in distributed systems design, AI/ML, and implementation at scale in the cloud.
- Deep expertise in cloud computing platforms and modern AI/ML technologies.
- Experience with modern LLMs, ideally related to hosting, serving, and optimizing such models.
- Experience with one or more inference engines would be a bonus: vLLM, SGLang, Modular Max etc.
- Experience researching, evaluating, and building with open source technologies.
- Proficiency in programming languages commonly used in cloud development, such as Python and Go.
Responsibilities
- Design and implement an inference platform for serving large language models optimized for the various GPU platforms they will be run on.
- Develop and shepherd complex AI and cloud engineering projects through the entire product development lifecycle (PDLC) - ideation, product definition, experimentation, prototyping, development, testing, release, and operations.
- Optimize runtime and infrastructure layers of the inference stack for best model performance.
- Build native cross platform inference support across NVIDIA and AMD GPUs for a variety of model architectures.
- Contribute to open source inference engines to make them perform better on DigitalOcean cloud.
- Build tooling and observability to monitor system health, and build auto tuning capabilities.
- Build benchmarking frameworks to test model serving performance to guide system and infrastructure tuning efforts.
Other
- A strong sense of ownership and a drive to figure out and resolve any issues preventing you and your team from delivering value to your customers
- An appreciation for process and developing cross-disciplinary collaboration between engineering, operations, support, and product groups
- Familiarity with end-to-end quality best practices and their implementation
- Experience coordinating with partner teams across time zones and geographies
- Experience with infrastructure as code (IaC) tools like Terraform or Ansible