Together AI is seeking to design and build scalable machine learning systems that power their accelerated AI initiatives.
Requirements
- Strong programming skills in one or more of Python, Go, Rust, or C/C++.
- Excellent understanding of low-level operating systems concepts including multi-threading, memory management, networking, and storage, performance, and scale.
- Experience with cloud computing platforms (AWS, GCP, Azure etc.) and large-scale infrastructure.
- Experience with Kubernetes (Preferred)
- Experience with Pytorch (Preferred)
- 3+ years of experience in building large-scale, fault-tolerant, high-performance distributed systems.
Responsibilities
- Design and build large-scale, distributed machine learning systems that are fault-tolerant and high-performance.
- Develop and optimize distributed processing frameworks and storage systems.
- Implement robust monitoring and logging systems to ensure the health and performance of our ML systems.
- Conduct architecture and design reviews to ensure best practices in system design.
- Collaborate with researchers, engineers, and product managers to integrate ML systems into our infrastructure.
Other
- Strong problem-solving skills and ability to work in a fast-paced environment.
- US base salary range for this full-time position is $160,000 - $230,000 + equity + benefits.
- Startup equity, health insurance, and other competitive benefits.
- Equal Opportunity Employer