Design, build, deploy, and maintain the robust and scalable infrastructure that powers cutting-edge artificial intelligence (AI) and machine learning (ML) initiatives.
Requirements
- Strong programming skills in Python, C++, Go, or Rust for systems development and automation.
- Ability to design end-to-end systems that balance performance, reliability, security, and cost.
- Hands-on experience with ML training frameworks (PyTorch, TensorFlow, JAX) at scale.
- Knowledge of hardware-level optimization: CUDA, ROCm, kernel bypass, FPGA/ASIC integration.
- Experience with Heterogeneous Computing for AI, Bigdata, HPC.
- Open-source contributions or patents in the ML systems space.
Responsibilities
- Lead end-to-end design of scalable, reliable AI infrastructure (AI accelerators, compute clusters, storage, networking) for training and serving large ML workloads.
- Define and implement service-oriented, containerized architectures (Kubernetes, VM frameworks, unikernels) optimized for ML performance and security.
- Profile and optimize every layer of the ML stack—ML Compiler, GPU/TPU scheduling, NCCL/RDMA networking, data preprocessing, and training/inference frameworks.
- Develop low-overhead telemetry and benchmarking frameworks to identify and eliminate bottlenecks in distributed training and serving.
- Build and operate large-scale deployment and orchestration systems that auto-scale across multiple data centers (on-premises and cloud).
- Architect and implement robust ETL and data ingestion pipelines (Spark/Beam/Dask/Flume) tailored for petabyte-scale ML datasets.
- Integrate experiment management and workflow orchestration tools (Airflow, Kubeflow, Metaflow) to streamline research-to-production.
Other
- Master's degree (PhD's degree is preferred) in Computer Science, Engineering, or a related technical field.
- 5+ years in infrastructure or systems engineering focused roles, with at least 2 years focused on ML/AI infrastructure.
- Excellent communicator able to bridge research and production teams.
- Strong problem-solving aptitude and a drive to push the state of the art in ML infrastructure.
- Publications in top tier ML or System Conferences such as MLSys, ICML, ICLR, KDD, NeurIPS (Preferred)