Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Research Scientist Intern - AI Infrastructure

ByteDance

Salary not specified

Sep 27, 2025

San Jose, CA, USA

ByteDance is looking to solve the problem of creating and maintaining a robust and scalable infrastructure that powers their cutting-edge artificial intelligence (AI) and machine learning (ML) initiatives, enabling the next generation of AI-driven products and services.

Requirements

Understanding of infrastructure or systems engineering focused roles, with ML/AI infrastructure.
Strong programming skills in Python, C++, Go, or Rust for systems development and automation.
Expert in at least one of the following fields to define and design the next-gen AI Infrastructure: Infrastructure Design & Architecture, Performance Optimization, Distributed Systems & Scalability, Data Pipeline & Workflow Engineering.
Experience with service-oriented, containerized architectures (Kubernetes, VM frameworks, unikernels).
Experience with ML Compiler, GPU/TPU scheduling, NCCL/RDMA networking, data preprocessing, and training/inference frameworks.
Experience with large-scale deployment and orchestration systems.
Experience with ETL and data ingestion pipelines (Spark/Beam/Dask/Flume).

Responsibilities

Lead end-to-end design of scalable, reliable AI infrastructure (AI accelerators, compute clusters, storage, networking) for training and serving large ML workloads.
Define and implement service-oriented, containerized architectures (Kubernetes, VM frameworks, unikernels) optimized for ML performance and security.
Profile and optimize every layer of the ML stack—ML Compiler, GPU/TPU scheduling, NCCL/RDMA networking, data preprocessing, and training/inference frameworks.
Develop low-overhead telemetry and benchmarking frameworks to identify and eliminate bottlenecks in distributed training and serving.
Build and operate large-scale deployment and orchestration systems that auto-scale across multiple data centers (on-premises and cloud).
Champion fault-tolerance, high availability, and cost-efficiency through smart resource management and workload placement.
Architect and implement robust ETL and data ingestion pipelines (Spark/Beam/Dask/Flume) tailored for petabyte-scale ML datasets.

Other

Graduation date in 2026 year with a PhD in Computer Science, Engineering, or a related technical field.
Excellent communicator able to bridge research and production teams.
Strong problem-solving aptitude and a drive to push the state of the art in ML infrastructure.
State your availability clearly in your resume (Start date, End date).
Applications will be reviewed on a rolling basis - we encourage you to apply early.