The company is looking to ensure reliable, rapid deployment, robust monitoring, and resilient operations of large-scale systems.
Requirements
- Expert-level proficiency in architecting, developing, and troubleshooting large-scale systems.
- Highly skilled in one or more programming languages (e.g., Python, Golang).
- In-depth knowledge of data structures, Linux system internals (e.g., filesystems, system calls), and administration.
- Extensive experience with CI/CD pipelines and infrastructure-as-code tools (e.g., Terraform, Ansible).
- Well-versed in AWS services (e.g., ECS, S3, ALB, VPC).
- Strong expertise in containerization and orchestration technologies, particularly Kubernetes.
- Proven track record of building production-quality cloud infrastructure for large-scale systems with effective monitoring and resilient operations.
Responsibilities
- Design and implement comprehensive monitoring systems using tools like Prometheus and Grafana.
- Optimize Linux systems to achieve top-tier performance, reliability, and security.
- Own configuration management processes and contribute to product feature development.
- Automate and enhance continuous integration and testing workflows to drive scalability.
- Manage and maintain critical infrastructure for seamless operations.
- Dive deep into data to identify root causes of issues and collaborate with engineers to develop solutions.
- Participate in on-call rotations to ensure smooth system operations.
Other
- Shape and influence engineering best practices and deployment processes.
- Thrive in fast-paced, startup-like environments.
- Demonstrated success in taking projects from inception to launch.
- Bachelor’s degree in Computer Science or Electrical Engineering (Master’s degree preferred).
- 5+ years of experience in SRE, Production Engineering, or DevOps roles.