Mastercard is looking to ensure that its platform is stable and healthy by fostering developer run ownership and empowering developers to build resilient products. The Business Operations (Biz Ops) team is seeking a Business Operations Site Reliability Engineer (SRE) to support services before they go live, manage production and change activities, and mitigate risks.
Requirements
- Coding or scripting exposure.
- Experience with algorithms, data structures, scripting, pipeline management, and software design
- Coding experience in one or more of the following: C++, Java, Python, Go
- Background on cloud native tooling and orchestration technologies (Kubernetes preferred).
- Experience in Monitoring tools such as Splunk, Dynatrace.
- Experience with industry standard CI/CD tools like Git/BitBucket, Jenkins, Maven, Artifactory, Groovy and Chef.
- Developing and maintaining cloud solutions on Azure, GCP, or AWS in accordance with best practices.
Responsibilities
- Serve as the primary contact responsible for the overall application health, performance, and capacity
- Support services before they go live through activities such as system design consulting, capacity planning and launch reviews.
- Partner with the development and product team of a new application to establish the right monitoring and alerting strategy and create the framework to achieve zero downtime during deployment.
- Serve as the primary contact responsible for ensuring application scalability, performance, and resilience.
- Practice sustainable incident response and blameless post-mortems while taking a holistic approach to problem solving and optimizing time to recover.
- Automate data-driven alerts to proactively escalate issues.
- Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation, and refinement.
Other
- Appetite for change and pushing the boundaries of what can be done with automation.
- Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.
- Willingness and ability to learn and take on challenging opportunities and to work as a member of matrix based diverse and geographically distributed project team.
- Ability to balance doing things right with fixing things quickly. Flexible and pragmatic, while working towards improving the long-term health of the system.
- Comfortable collaborating with cross-functional teams to ensure that expected system behaviour is understood and monitoring exists to detect anomalies.