DoorDash is looking to solve the problem of building a reliable on-demand logistics engine, specifically by driving the next generation of their inference platform to power real-time predictions across millions of requests per second.
Requirements
- 8+ years of engineering experience, including building or operating large-scale, high-QPS ML serving systems.
- Deep familiarity with ML inference and serving ecosystems.
- Knowledge of how to leverage and extend open-source frameworks and evaluate vendor solutions pragmatically.
- GPU serving expertise - Experience with frameworks like NVIDIA Triton, TensorRT-LLM, ONNX Runtime, or vLLM, including hands-on use of KV caching, batching, and memory-efficient inference.
- Familiarity with deep learning frameworks (PyTorch, TensorFlow) and large language models (LLMs) such as GPT-OSS or BERT.
- Hands-on experience with Kubernetes/EKS, microservice architectures, and large-scale orchestration for inference workloads.
- Cloud experience (AWS, GCP, Azure) with a focus on scaling strategies, observability, and cost optimization.
Responsibilities
- Scale richer models at low latency by designing serving systems that handle large, complex models while balancing cost, throughput, and strict latency SLOs.
- Bring modern inference optimizations into production by operationalizing advances from the ML serving ecosystem to deliver better user experience, latency, and cost efficiency across the fleet.
- Enable platform-wide impact by building abstractions and primitives that let serving improvements apply broadly across many workloads, rather than point solutions for individual models.
- Leverage and contribute to OSS by applying the best of the open-source serving ecosystem and vendor solutions, and contributing improvements back where it helps the community.
- Drive cost & reliability by designing autoscaling and scheduling across heterogeneous hardware (GPU/TPU/CPU), with strong isolation, observability, and tail-latency control.
- Collaborate broadly by partnering with ML engineers, infra teams, external vendors, and open-source communities to ensure the serving stack evolves with the needs of the business.
- Raise the engineering bar by establishing metrics & processes that improve developer velocity, system reliability, and long-term maintainability.
Other
- Have 8+ years of engineering experience.
- Lead by example - collaborating effectively, mentoring peers, and setting a high bar for craftsmanship.
- Care deeply about reliability, performance, observability, and security in production systems.
- Balance hands-on execution with long-term platform thinking, making sound trade-offs.
- Notice to Applicants for Jobs Located in NYC or Remote Jobs Associated With Office in NYC Only