Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

HeyGen Logo

Software Engineer, AI Compute Infrastructure

HeyGen

Salary not specified
Dec 2, 2025
Los Angeles, CA, US • CA, US • Remote, US • Palo Alto, CA, US • CA, US • Remote, US • San Francisco, CA, US • CA, US • Remote, US
Apply Now

HeyGen aims to make visual storytelling accessible by building technology that equips more people with the power to reach, captivate, and inspire audiences. The company is seeking a Software Engineer to build and scale the foundational compute infrastructure that powers their AI models, addressing the challenges of costly and difficult-to-scale video creation.

Requirements

  • 5+ years of full-time industry experience in large-scale MLOps, AI infrastructure, or HPC systems.
  • Experience with data frameworks and standards like Ray, Apache Spark, LanceDB
  • Strong proficiency in Python and a high-performance language such as C++ for developing core infrastructure components.
  • Deep understanding and hands-on experience with modern orchestration and distributed computing frameworks such as Kubernetes and Ray.
  • Experience with core ML frameworks such as PyTorch, TensorFlow, or JAX.
  • Prior experience building infrastructure specifically for Generative AI models (e.g., diffusion models, GANs, or large language models) where cost and latency are critical.
  • Expertise in GPU acceleration and deep familiarity with low-level compute programming, including CUDA, NCCL, or similar technologies for efficient inter-GPU communication.

Responsibilities

  • Optimize GPU Utilization: Design and implement mechanisms to aggressively optimize GPU and cluster utilization across thousands of devices for inference, training, data processing and large-scale deployment of our state-of-art video generation models.
  • Develop Large-Scale AI Job Framework: Build highly scalable, reliable frameworks for launching and managing massive, heterogeneous compute jobs, including multi-modal high-volume data ingestion/processing, distributed model training, and continuous evaluation/benchmarking.
  • Enhance Observability: Develop world-class observability, tracing, and visualization tools for our compute cluster to ensure reliability, diagnose performance bottlenecks (e.g., memory, bandwidth, communication).
  • Accelerate Pipelines: Collaborate closely with AI researchers and AI engineers to integrate innovative acceleration techniques (e.g., custom CUDA kernels, distributed training libraries) into production-ready, scalable training and inference pipelines.
  • Infrastructure Management: Champion the adoption and optimization of modern cloud and container technologies (Kubernetes, Ray) for elastic, cost-efficient scaling of our distributed systems.

Other

  • Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
  • Demonstrated Tech Lead experience, driving projects from conceptual design through to production deployment across cross-functional teams.
  • Proven background in building and operating large-scale data infrastructure (e.g., Ray, Apache Spark) to manage petabytes of multi-modal data (video, audio, text).
  • Dynamic and inclusive work environment.
  • Opportunities for professional growth and advancement.