DAT is seeking an experienced Site Reliability Engineer to help grow its SRE practices, contribute to technical initiatives, enhance skills, and achieve critical reliability goals while scaling the platform.
Requirements
- At least 1 year of software engineering experience (JavaScript, Python, Go, Java/Kotlin, C++, etc)
- Experience with modern observability tools (Datadog preferred).
- Experience with cloud platforms (preferably AWS).
- Proven experience assisting in modernizing legacy code and infrastructure.
- Understanding of cloud infrastructure, automation, and best practices for reliability.
- Experience with our tools (Kubernetes, ArgoCD, Terraform, Github Actions) a plus.
Responsibilities
- Contribute to the design, implementation, and maintenance of scalable and reliable systems.
- Identify and troubleshoot complex issues across distributed systems, ensuring minimal downtime and optimal performance.
- Advocate for and implement SRE best practices, including automation, monitoring, and incident response, to enhance system resilience.
- Participate in capacity planning and performance tuning to proactively address potential bottlenecks and support future growth.
- Leverage new AI tools to assist with coding and observability tasks.
- Assist and respond to critical engineering incidents.
- Provide technical guidance and best practices for use of cloud infrastructure and tooling.
Other
- Strong collaboration and problem-solving abilities, especially within SRE or Platform Engineering/Infrastructure teams.
- Total of 2 to 4+ years industry experience
- Demonstrated success in contributing to large technical initiatives and acting as a driving force to complete those initiatives.
- Ability to work closely with peer teams, platform/software architects and management to drive key reliability improvements.
- Willingness to share your expertise among team members and others within the engineering organization.