DataRobot is looking for a Service Reliability Engineer (SRE) to ensure the continuous operation and stability of their production environment, troubleshoot issues, and automate operational tasks to improve the reliability and availability of their AI platform for customers.
Requirements
- Linux/UNIX (Ubuntu, RedHat, or similar)
- Experience with Kubernetes
- Terraform or CloudFormation
- Ansible
- MongoDB, RabbitMQ, Postgres, Redis
- ELK, Clickhouse, Grafana, etc.
- AWS or experience in GCP or Azure
Responsibilities
- Troubleshoot, debug, evaluate and resolve alarms
- Perform systems management
- Perform software deployments and migrations
- Automate routine operational tasks
- Designing and modifying DataRobot tools and practices to provide observability and seamless scaling
- Proactively preventing failures
- Ensure the continuous operation of the production environment through maintenance activities and prompt reaction to alerts
Other
- Bachelor's Degree in CS, MIS, or equivalent experience
- Solid communications skills