The partner company is looking to solve challenges related to observability, cloud optimization, and CI/CD pipeline performance within a modern container-based environment. They aim to improve visibility, reliability, and sustainability across large-scale OpenShift deployments, drive operational excellence, and automate processes.
Requirements
- 5+ years’ experience with Linux, Groovy/Python/Go for infrastructure automation using Ansible, containers, GitLab, Jenkins, and JSON.
- 5+ years of coding experience, including code reviews and production support.
- 4+ years’ experience with OpenShift, Kubernetes, observability, Prometheus, Grafana, Honeycomb, distributed tracing, SLO/SLI design, and Tekton-based CI/CD.
- Experience optimizing multi-tenant OpenShift clusters and implementing HPA/VPA models.
- Strong background in automating dashboards, alerts, and monitoring workflows.
- 3+ years’ experience with Scrum, Atlassian tools (JIRA, Confluence), release engineering, productivity analysis, PagerDuty-based incident management, and GitOps with ArgoCD.
- Strong debugging skills across multiple software layers and distributed systems.
Responsibilities
- Architect and implement observability solutions for large-scale release pipelines using Prometheus, Grafana, Honeycomb, SignalFx, and OpenTelemetry.
- Develop and optimize resource utilization for OpenShift workloads through HPA, VPA, and tuned Requests/Limits.
- Build energy consumption monitoring and reporting using Power Monitoring Operator and Kepler to support sustainability metrics.
- Integrate FinOps reporting using Cost Management Operator, including cost allocation strategies and tagging frameworks.
- Define and implement SLIs, SLOs, and KPIs to enhance pipeline reliability and service availability.
- Write infrastructure code in Groovy, Python, and Go to automate provisioning, configuration, and lifecycle management with Ansible.
- Enhance Tekton CI/CD observability, including build/test performance visibility and deep tracing for debugging.
Other
- Master’s degree in Computer Science or a related field, plus 8 years of relevant experience.
- Expertise with BREW, PUNGI, RHPKG, ERRATA, and enterprise build/release tooling.
- 2+ years’ experience mentoring engineers, sustainability tracking using Kepler, implementing tagging for cloud cost analysis, and conducting technical interviews.
- Experience managing large-scale projects including data center migrations and service continuity planning.
- Flexible remote-work setup within the United States.