Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Clockwork Systems Logo

Software Engineer - Distributed Training Infrastructure

Clockwork Systems

Salary not specified
Aug 13, 2025
Palo Alto, CA, US
Apply Now

Clockwork.io is seeking a Software Engineer to enhance the performance, scalability, and resilience of large-scale distributed training infrastructure based on the PyTorch ecosystem, addressing the challenges of managing time, reliability, and performance in distributed systems powering modern AI.

Requirements

  • Deep experience with PyTorch and torch.distributed (c10d)
  • Hands-on experience with at least one of: Megatron-LM, DeepSpeed, or FairScale
  • Proficiency in Python and Linux shell scripting
  • Experience with multi-node GPU clusters using Slurm, Kubernetes, or similar
  • Strong understanding of NCCL, collective communication, and GPU topology
  • Familiarity with containerized training environments (Docker, Singularity)
  • Experience scaling LLM training across 8+ GPUs and multiple nodes

Responsibilities

  • Develop and support distributed PyTorch training jobs using torch.distributed / c10d
  • Integrate and maintain frameworks like Megatron-LM, DeepSpeed, and related LLM training stacks
  • Diagnose and resolve distributed training issues (e.g., NCCL hangs, OOM, checkpoint corruption)
  • Optimize performance across communication, I/O, and memory bottlenecks
  • Implement fault tolerance, checkpointing, and recovery mechanisms for long-running jobs
  • Write tooling and scripts to streamline training workflows and experiment management
  • Collaborate with ML engineers to ensure compatibility with orchestration and container environments (e.g., Slurm, Kubernetes)

Other

  • All qualified applicants will receive consideration for employment without regard to race, color, ancestry, religion, age, sex, sexual orientation, gender identity, national origin, or protected veteran status and will not be discriminated against on the basis of disability.