BNY is looking to improve the reliability and performance of its Wealth Services Platform by hiring a Site Reliability Engineer to drive reliability, automate infrastructure, and lead incident management.
Requirements
- Strong expertise in cloud infrastructure (Azure, AWS, or GCP), containerization (Docker, Kubernetes), and Infrastructure as Code (Terraform, Helm).
- Proficiency in observability and monitoring tools such as Prometheus, Grafana, AppDynamics, Datadog, Splunk, and experience with incident response and on-call support.
- Solid programming and scripting skills in languages like Python, Go, or Java, with a focus on automation, tooling, and system integration.
- Deep understanding of SRE principles, including SLAs, SLOs, error budgets, postmortems, and reliability-focused system design.
Responsibilities
- Drive reliability and performance by defining SLOs/SLIs, improving observability, and proactively identifying and addressing system bottlenecks across cloud environments.
- Automate infrastructure and operations using Terraform, Kubernetes, and CI/CD tools to eliminate toil and enable scalable, fault-tolerant deployments.
- Collaborate cross-functionally with product, infrastructure, and DevOps teams to reduce incidents, build resilient services, and ensure architectural clarity.
- Lead incident management by participating in on-call rotations, conducting postmortems, and implementing automated recovery to minimize downtime.
- Build and maintain monitoring systems with tools like Prometheus, Grafana, AppDynamics, and Splunk to support real-time alerting and root cause analysis.
- Develop platform tooling and pipelines for container orchestration, third-party integrations, and cloud-native operations to improve system efficiency and reliability.
- Mentor engineers and champion SRE best practices, embedding a reliability-first culture and ensuring technical excellence across engineering teams.
Other
- Collaborate cross-functionally with product, infrastructure, and DevOps teams.
- Strong collaboration and communication skills, with experience working in Agile environments and partnering with cross-functional engineering, product, and operations teams.
- Mentor engineers and champion SRE best practices, embedding a reliability-first culture and ensuring technical excellence across engineering teams.