At eBay, we are building the next-generation AI platform to power experiences for millions of users worldwide. Our AI Platform (AIP) provides the scalable, secure, and efficient foundation for deploying and optimizing advanced machine learning and large language model (LLM) workloads at production scale.
Requirements
- Proven expertise with cloud-native technologies (AWS, GCP, Azure) and Kubernetes-based deployments.
- Hands-on experience running ML training and inference with Ray (ray.io)—e.g., Ray Train/Tune for distributed training and Ray Serve for production inference—covering autoscaling, fault tolerance, observability and multi-tenant operations.
- Deep understanding of networking, security, authentication, and identity management in distributed/cloud environments.
- Hands-on experience with observability stacks (Prometheus, Grafana, OpenTelemetry, etc.).
- Strong coding skills in Go and/or Python; familiarity with other systems-level languages is a plus.
- Knowledge of Linux internals, containers, and storage systems.
- Experience optimizing for GPU/accelerator integration (NVIDIA, AMD, TPU, etc.) is highly desirable.
Responsibilities
- Design and scale services to orchestrate AI/ML clusters across cloud and on-prem environments, supporting VM and Kubernetes-based deployments, including Ray (ray.io) clusters for distributed training and online inference.
- Develop and optimize intelligent scheduling and resource management systems for heterogeneous compute clusters (CPU, GPU, accelerators).
- Integrate Ray Train/Tune for large-scale distributed training workflows and Ray Serve for low-latency, autoscaled inference; build platform hooks for observability, canary/A-B rollouts, and fault tolerance.
- Build features to improve reliability, performance, observability, and cost-efficiency of AI workloads at scale.
- Enhance the control plane to support secure multi-tenancy and enterprise-grade governance.
- Implement systems for container management, dependency resolution, and large-scale model distribution.
- Provide production support and work closely with field teams to resolve infrastructure issues.
Other
- 8-10 years of experience building and maintaining infrastructure for highly available, scalable, and performant distributed systems.
- Collaborate with ML researchers, applied scientists, and distributed systems engineers to drive platform innovation.
- Bachelor’s or Master’s degree in Computer Science, Engineering, or related field (or equivalent experience).
- LI-Hybrid
- The base pay range for this position is expected in the range below: $132,000 - $222,100