Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Baseten Logo

Senior Software Engineer - Model Training

Baseten

Salary not specified
Aug 29, 2025
San Francisco, CA, US • New York, NY, US
Apply Now

Baseten is looking for a Senior Software Engineer – Model Training to build the infrastructure for large-scale training and fine-tuning of foundation models, optimizing GPU utilization and creating scalable pipelines to make AI accessible across all products.

Requirements

  • Hands-on expertise in distributed training frameworks (FSDP, DDP, ZeRO, or similar) and ML frameworks (PyTorch, Transformers, Lightning, TRL)
  • Strong understanding of GPU/accelerator performance optimization and scaling techniques
  • Experience designing and operating large-scale systems in production (cloud-native preferred)
  • Experience building APIs, SDKs, or developer tools for ML workflows
  • Familiarity with cluster management and scheduling (Kubernetes, Ray, Slurm, etc.)
  • Knowledge of parameter-efficient fine-tuning methods (LoRA, QLoRA) and evaluation pipelines
  • Contributions to open-source distributed training or ML infra projects

Responsibilities

  • Design, build, and maintain distributed training infrastructure for large-scale foundation models
  • Implement scalable pipelines for fine-tuning and training across heterogeneous GPU/accelerator clusters
  • Optimize training performance through techniques like FSDP, DDP, ZeRO, and mixed precision training
  • Contribute to frameworks and tooling that make training workflows efficient, reproducible, and developer-friendly
  • Collaborate with cross-functional teams (Product, Forward Deployed Engineering, Inference Infra) to ensure training systems meet real-world requirements
  • Research and apply emerging techniques in training efficiency and model adaptation, and productionize them in the Baseten platform
  • Participate in code reviews, system design discussions, and technical deep dives to maintain a high engineering bar

Other

  • Bachelor’s degree in Computer Science, Engineering, or related field, or equivalent experience
  • 4+ years of experience in software engineering with a focus on ML infrastructure, distributed systems, or ML platform engineering
  • Excellent problem-solving and communication skills, with the ability to work across infrastructure and ML boundaries
  • This is a unique opportunity to be part of a rapidly growing startup in one of the most exciting engineering fields of our era.
  • An inclusive and supportive work culture that fosters learning and growth.