Scaling and hardening the infrastructure that powers some of the most widely used AI systems in the world, ensuring systems are highly reliable, observable, performant, and secure.
Requirements
- Have a deep understanding of distributed systems principles and a proven track record in building and operating scalable and reliable systems.
- Have a keen eye for performance and optimization. You know how to squeeze the most performance out of complex, globally-distributed systems.
- Have experience operating orchestration systems such as Kubernetes at scale and building abstractions over cloud platforms
- Are comfortable working in Linux environments, and with tools like Kubernetes, Terraform, CI/CD pipelines, and modern observability stacks.
- Strong proficiency in cloud infrastructure (like AWS, GCP, Azure) and IaC tools such as Terraform. Proficiency in programming / scripting languages.
- Experience with containerization technologies and container orchestration platforms like Kubernetes.
- Experience with observability tools such as Datadog, Prometheus, Grafana, Splunk and ELK stack.
Responsibilities
- Design, build, and operate reliable and performant systems used across engineering.
- Identify and fix performance bottlenecks and inefficiencies, ensuring our infrastructure can scale to the next order of magnitude.
- Dig deep to resolve complex issues.
- Continuously improve automation to reduce manual work.
- Improve internal tooling and our developer experience.
- Contribute to incident response, postmortems, and the development of best practices around system reliability and scalability.
Other
- 4+ years of relevant industry experience, with 2+ years leading large scale, complex projects or teams as an engineer or tech lead
- A passion for distributed systems at scale with a focus on reliability, scalability, security, and continuous improvement.
- Proven experience as an reliability engineer, production engineer, or a similar role in a fast-paced, rapidly scaling company.
- Are experienced in collaborating with cross-functional teams to ensure that reliability and scalability are considered in the design and development of new features and services.
- Have a humble attitude, an eagerness to help your colleagues, and a desire to do whatever it takes to make the team succeed.