Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

ML Systems Engineer, Infrastructure & Cloud

Basis Research Institute

Salary not specified

Nov 23, 2025

New York, NY, US • Cambridge, Massachussetts, US

Basis is looking to solve the problem of enabling researchers to iterate quickly on complex models while managing computational resources efficiently by ensuring training and evaluation infrastructure is fast, reliable, and scalable.

Requirements

Have demonstrated expertise in ML systems engineering. Examples include: Managing distributed training jobs across hundreds of GPUs, Debugging and fixing numerical instabilities in large-scale training, Building infrastructure for reproducible ML experiments, Optimizing training throughput and resource utilization
Possess deep knowledge of distributed training frameworks including PyTorch/JAX distributed strategies (DDP, FSDP, ZeRO), gradient accumulation, mixed precision training, and checkpoint/recovery systems.
Have strong cloud administration skills including AWS/GCP/Azure services, infrastructure as code (Terraform), Kubernetes orchestration, cost optimization, security best practices, and compliance requirements.
Understand the full ML stack from hardware (GPUs, interconnects, storage) through frameworks (PyTorch, JAX) to high-level training loops and evaluation pipelines.
Be skilled at debugging complex failures across the stack—GPU/NCCL issues, data loading bottlenecks, memory leaks, gradient explosions, and convergence problems.
Value documentation and knowledge sharing. You maintain comprehensive logs of issues encountered, solutions found, and lessons learned, building institutional knowledge.
Progress with autonomy while coordinating closely with researchers. You can anticipate infrastructure needs, prevent problems before they occur, and respond quickly when issues arise.

Responsibilities

Own distributed training infrastructure including job launchers, checkpointing systems, recovery mechanisms, and monitoring that ensures experiments run reliably at scale.
Debug and resolve training failures by diagnosing issues across GPUs, networking, numerics, and data pipelines, maintaining detailed logs of problems and solutions.
Profile and optimize training performance by identifying bottlenecks in data loading, gradient computation, communication overhead, and implementing solutions that improve step time.
Manage cloud infrastructure and costs including capacity planning, spot instance strategies, storage optimization, and building tools that give researchers visibility into resource usage.
Implement security and compliance measures including access controls, data encryption, audit logging, and ensuring infrastructure meets requirements for handling sensitive data.
Build evaluation and benchmarking infrastructure that enables consistent, reproducible measurement of model performance across different conditions and datasets.
Develop monitoring and alerting systems that detect anomalies in training metrics, resource utilization, or system health, enabling rapid response to issues.

Other

We are looking for engineers who combine deep understanding of ML systems with operational excellence.
We seek individuals who aspire to build robust ML infrastructure, maintain “logbook culture” for documenting issues and solutions, and treat operational excellence as a first-class concern.
In-person Policy: We are in the office four days a week. Be prepared to attend multi-day Basis-wide in-person events.
Location: New York City or Cambridge, MA.
Collaborate with researchers to understand requirements, suggest infrastructure solutions, and ensure systems support rather than constrain research goals.