DoorDash is looking to enhance its Machine Learning Platform to support high-volume, GPU-accelerated training in a fast-evolving environment.
Requirements
Hands-On ML Platform/Infra Experience – You’re familiar with modern machine learning stacks (e.g., PyTorch, LightGBM, TensorFlow) and have built or maintained large-scale training environments.
Strong CS fundamentals – You excel at crafting solutions that handle scale, complexity, and reliability challenges.
GPU Acceleration – Experience with GPU-enabled training and its associated performance optimizations.
MLOps Tooling – Familiarity with orchestration and tracking frameworks such as Metaflow, MLflow, Dagster, or Airflow.
Large-Scale Data Processing – Knowledge of Spark, Hadoop, or other distributed data processing technologies.
Monitoring & Observability – Proficiency with metrics and alerting solutions (e.g., Prometheus, Grafana).
Cloud Platforms – Experience with AWS or GCP for scalable compute, container orchestration, and cost management.
Responsibilities
Drive Key Training Initiatives – Own and deliver significant sub-projects that enhance our platform’s performance, reliability, and ease of use.
Architect & Implement Scalable Solutions – Design resilient pipelines for distributed model training (e.g., PyTorch, LightGBM) on Kubernetes, optimizing for both short-term speed and long-term maintainability.
Set a High Bar for Quality & Reliability – Lead by example with clean, high-performance code, thorough design reviews, and a focus on observability, incident mitigation, and continuous improvement.
Mentor & Influence – Help level up peers by sharing knowledge, driving best practices, and contributing to a supportive team culture that values empathy and technical excellence.
Collaborate with Cross-Functional Teams – Work with ML engineers, Data Scientists, and product stakeholders to refine requirements, set realistic milestones, and ensure smooth delivery.
Other
6+ years of industry experience in software engineering, with a deep understanding of distributed systems and data-intensive ML pipelines in production.
Proven Project Ownership – You can break down complex initiatives, estimate accurately, and deliver major projects with minimal oversight.
Collaboration & Communication – You’re adept at partnering across functions, setting expectations, and ensuring alignment among diverse stakeholders.
Thrive on Continuous Improvement – You proactively identify gaps, reduce technical debt, and optimize resource usage, balancing cost and performance.
401(k) plan with employer matching, 16 weeks of paid parental leave, wellness benefits, commuter benefits match, paid time off and paid sick leave in compliance with applicable laws.