Together AI is building the AI Acceleration Cloud, an end-to-end platform for the full generative AI lifecycle, combining the fastest LLM inference engine with state-of-the-art AI cloud infrastructure. The AI Infrastructure team is looking to design, implement, and maintain robust distributed storage solutions and comprehensive observability platforms to power their generative AI platform.
Requirements
- 5+ years of demonstrated experience in building large scale, fault tolerant, distributed systems and API microservices
- Experience designing, analyzing and improving efficiency, scalability, and stability of various system resources
- Demonstrated experience with building and operating high-performance and/or globally distributed microservice architectures across one or more cloud providers (AWS, Azure, GCP)
Responsibilities
- Identify, design, and develop foundational backend services that power Together’s cloud platform
- Analyze and improve the robustness and scalability of existing distributed systems, APIs, databases, and infrastructure
- Partner with product teams to understand functional requirements and deliver solutions that meet business needs
- Write clear, well-tested, and maintainable software and IaC for both new and existing systems
- Conduct design and code reviews, create developer documentation, and develop testing strategies for robustness and fault tolerance
- Participate in an on-call rotation to address critical incidents when necessary
Other
- Excellent communication skills – able to write clear design docs and work effectively with both technical and non-technical team members