Design and build core infrastructure from scratch to handle petabytes of radio spectrum data for a defense tech company.
Requirements
- 2-7 years of experience in ML Ops, building platforms that track, package, and ship models in production.
- Experience designing and implementing data infrastructure from scratch (databases, cloud storage, cloud compute).
- Strong experience with AWS, including networking, S3, Sagemaker, RDS, ECS, Lambda, and related infra-as-code tools (AWS CDK).
- Experience managing a production-grade Python codebase used by other people.
- Ability to deal with larger-than-memory data inexpensively without setting up a cluster.
- Experience with ML experiment tracking, model versioning, and artifact deployment (e.g., MLflow).
Responsibilities
- Scale distributed data storage and write Python APIs to make loading massive datasets (e.g., 30GB) feel instantaneous.
- Set up orchestration for model training on GPU clusters, including versioning and artifact deployment.
- Explore creative ways to combine relational and vector-based search queries to enable researchers to quickly discover the most relevant data.
- Define large areas of the engineering roadmap and collaborate extensively with researchers.
Other
- Full-time employment
- 5 days/week in-person (Non-negotiable)
- High-growth and high-ownership culture
- U.S. Citizenship is required
- Relocation support is offered for strong candidates moving to NYC.