Building and scaling foundational cloud, data, and AI infrastructure to power machine learning workloads across the organization.
Requirements
- Strong expertise in cloud infrastructure (AWS or GCP) and distributed computing
- Experience with Kubernetes, container orchestration, and infrastructure as code (Terraform, Pulumi)
- Proficiency in programming languages, with experience in Python and Go being a plus
- Experience writing ETL pipelines, with experience in Spark or BigQuery being preferred
- Experience with ML infrastructure, including model training, batch and online inference, and monitoring
- Strong knowledge of networking, storage, and security in large-scale systems
- Familiarity with workflow orchestration tools (e.g., Dagster, Airflow) and model-serving frameworks (e.g., Ray Serve, vLLM)
Responsibilities
- Designing and optimizing high-performance training, inference, and data processing systems
- Ensuring reliability, scalability, and efficiency of AI infrastructure
- Providing robust compute, model serving, monitoring, and orchestration frameworks to drive innovation and operational excellence
- Leading complex infrastructure projects from design to production
- Designing high-availability, fault-tolerant systems for AI/ML workloads
- Optimizing performance and cost efficiency of AI workloads on cloud and on-prem environments
- Working on developer tooling, platform engineering, or ML infrastructure to ensure AI teams can build and deploy efficiently
Other
- Proven track record of leading complex infrastructure projects from design to production
- Comfortable working on ambiguous and evolving projects, quickly identifying key challenges and driving solutions
- Passionate about building scalable, efficient, and cost-effective AI infrastructure that drives meaningful, real-world impact
- Strong product mindset, ensuring infrastructure is reliable, scalable, and built around user needs
- Collaborative and mentoring mindset, helping teammates grow while upholding high engineering standards
- Remote work, with quarterly trips to Sao Paulo to build relationships with coworkers
- Top Tier Medical, Dental, and Vision Insurance
- 20 days time off, 14 company holidays, and great culture that emphasizes work life balance
- Life Insurance and AD&D
- Extended maternity and paternity leaves
- 401K
- Saving Plans - Health Saving Account and Flexible Spending Account