The AI Infrastructure Specialist is responsible for building, managing, and optimizing the infrastructure that powers our entire AI development lifecycle, ensuring a stable, efficient, and scalable platform for data scientists and engineers to build, train, and deploy models.
Requirements
- Strong hands-on experience with cloud infrastructure (Azure preferred, AWS/GCP acceptable).
- Proficiency with Infrastructure as Code (IaC) tools like Terraform or Ansible.
- Expertise in containerization (Docker) and orchestration (Kubernetes).
- Experience building and managing CI/CD pipelines using tools like Jenkins, GitLab CI, or Azure DevOps.
- Solid understanding of networking, security principles, and database management (e.g., PostgreSQL).
- Scripting proficiency in Python or Bash.
Responsibilities
- Design, build, and maintain the CI/CD and MLOps pipelines for our AI products.
- Manage and support the infrastructure for model training and inference, leveraging cloud services and containerization technologies (e.g., Kubernetes, Docker).
- Automate the deployment, monitoring, and scaling of our AI services and applications in Dev, QA, and Production environments.
- Implement robust monitoring and alerting systems to ensure the health, performance, and reliability of our AI infrastructure.
- Collaborate closely with pipeline engineers and data scientists to troubleshoot and resolve infrastructure-related issues.
- Manage infrastructure configurations, security protocols, and access controls to ensure compliance with enterprise standards.
Other
- 5+ years of experience in a DevOps, SRE, or Infrastructure Engineering role, with a focus on supporting AI/ML workloads.