Pony.ai is looking to hire a (Senior) Kubernetes Engineer to design, operate, and optimize Kubernetes clusters across hybrid cloud environments to support diverse workloads including large-scale model training and low-latency inference services.
Requirements
- 3+ years of hands-on experience managing Kubernetes clusters in production (EKS/GKE/AKS and/or bare-metal).
- Strong Linux systems background and distributed systems fundamentals (scheduling, reliability, scaling).
- Proven experience with hybrid cloud environments (AWS, GCP, Azure, and on-prem).
- Expertise in containerization (Docker) and Infrastructure-as-Code tools (Terraform, Helm, Ansible, or similar).
- Experience developing and maintaining Kubernetes platform features (operators, CRDs, APIs).
- Solid knowledge of Kubernetes networking (CNI, ingress, service discovery), storage, and compute integrations.
- Strong understanding of security best practices (RBAC, network policies, secrets).
Responsibilities
- Design, operate, and optimize Kubernetes clusters across hybrid cloud environments (public cloud and on-prem datacenter).
- Support diverse workloads including large-scale model training and low-latency inference services.
- Develop, maintain, and extend Kubernetes platform features (operators, CRDs, APIs) to automate and productize internal use cases.
- Own cluster lifecycle management including upgrades, patching, configuration, and governance.
- Define and enforce best practices for service deployments, security policies, and operational guidelines.
- Contribute to observability and SRE practices to ensure reliability at scale (SLOs, incident reviews, metrics-driven improvements).
- Collaborate with storage, compute, and networking teams (CNI, ingress, service discovery) to enhance automation, availability, and performance.
Other
- Bachelor’s degree in Computer Science, Engineering, or related field, or equivalent experience.
- Effective communication skills and ability to work cross-functionally in a fast-paced environment.
- Provide technical mentorship, documentation, and on-call support for cluster-related incidents.