ByteDance is looking to solve the problem of creating and maintaining a robust and scalable infrastructure that powers their cutting-edge artificial intelligence (AI) and machine learning (ML) initiatives, enabling the next generation of AI-driven products and services.
Requirements
- Understanding of infrastructure or systems engineering focused roles, with ML/AI infrastructure.
- Strong programming skills in Python, C++, Go, or Rust for systems development and automation.
- Expertise in at least one of the following fields to define and design the next-gen AI Infrastructure: Infrastructure Design & Architecture, Performance Optimization, Distributed Systems & Scalability, Data Pipeline & Workflow Engineering.
- Experience with Kubernetes, VM frameworks, or unikernels.
- Familiarity with ML Compiler, GPU/TPU scheduling, NCCL/RDMA networking, data preprocessing, and training/inference frameworks.
- Experience with Spark/Beam/Dask/Flume for ETL and data ingestion.
- Experience with Airflow, Kubeflow, or Metaflow for experiment management and workflow orchestration.
Responsibilities
- Lead end-to-end design of scalable, reliable AI infrastructure (AI accelerators, compute clusters, storage, networking) for training and serving large ML workloads.
- Define and implement service-oriented, containerized architectures (Kubernetes, VM frameworks, unikernels) optimized for ML performance and security.
- Profile and optimize every layer of the ML stack—ML Compiler, GPU/TPU scheduling, NCCL/RDMA networking, data preprocessing, and training/inference frameworks.
- Develop low-overhead telemetry and benchmarking frameworks to identify and eliminate bottlenecks in distributed training and serving.
- Build and operate large-scale deployment and orchestration systems that auto-scale across multiple data centers (on-premises and cloud).
- Champion fault-tolerance, high availability, and cost-efficiency through smart resource management and workload placement.
- Architect and implement robust ETL and data ingestion pipelines (Spark/Beam/Dask/Flume) tailored for petabyte-scale ML datasets.
Other
- Graduation date in 2026 year with a PhD in Computer Science, Engineering, or a related technical field.
- Excellent communicator able to bridge research and production teams.
- Strong problem-solving aptitude and a drive to push the state of the art in ML infrastructure.
- State your availability clearly in your resume (Start date, End date).