Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

ByteDance Logo

Research Scientist Intern - AI Infrastructure

ByteDance

Salary not specified
Sep 27, 2025
San Jose, CA, USA
Apply Now

ByteDance is looking to solve the problem of creating and maintaining a robust and scalable infrastructure that powers their cutting-edge artificial intelligence (AI) and machine learning (ML) initiatives, enabling the next generation of AI-driven products and services.

Requirements

  • Understanding of infrastructure or systems engineering focused roles, with ML/AI infrastructure.
  • Strong programming skills in Python, C++, Go, or Rust for systems development and automation.
  • Expert in at least one of the following fields to define and design the next-gen AI Infrastructure: Infrastructure Design & Architecture, Performance Optimization, Distributed Systems & Scalability, Data Pipeline & Workflow Engineering.
  • Experience with service-oriented, containerized architectures (Kubernetes, VM frameworks, unikernels).
  • Experience with ML Compiler, GPU/TPU scheduling, NCCL/RDMA networking, data preprocessing, and training/inference frameworks.
  • Experience with large-scale deployment and orchestration systems.
  • Experience with ETL and data ingestion pipelines (Spark/Beam/Dask/Flume).

Responsibilities

  • Lead end-to-end design of scalable, reliable AI infrastructure (AI accelerators, compute clusters, storage, networking) for training and serving large ML workloads.
  • Define and implement service-oriented, containerized architectures (Kubernetes, VM frameworks, unikernels) optimized for ML performance and security.
  • Profile and optimize every layer of the ML stack—ML Compiler, GPU/TPU scheduling, NCCL/RDMA networking, data preprocessing, and training/inference frameworks.
  • Develop low-overhead telemetry and benchmarking frameworks to identify and eliminate bottlenecks in distributed training and serving.
  • Build and operate large-scale deployment and orchestration systems that auto-scale across multiple data centers (on-premises and cloud).
  • Champion fault-tolerance, high availability, and cost-efficiency through smart resource management and workload placement.
  • Architect and implement robust ETL and data ingestion pipelines (Spark/Beam/Dask/Flume) tailored for petabyte-scale ML datasets.

Other

  • Graduation date in 2026 year with a PhD in Computer Science, Engineering, or a related technical field.
  • Excellent communicator able to bridge research and production teams.
  • Strong problem-solving aptitude and a drive to push the state of the art in ML infrastructure.
  • State your availability clearly in your resume (Start date, End date).
  • Applications will be reviewed on a rolling basis - we encourage you to apply early.