The IT Operations team at S&P Dow Jones Indices (S&P DJI) needs to ensure the high availability of its Production IT systems that underpin S&P DJI's index platforms and applications. This role will focus on designing, implementing, and managing end-to-end observability using Datadog and related tools to maintain and improve service availability, respond to incidents, and enhance support processes.
Requirements
- Proven expertise in Datadog APM, DBM, logging, and infrastructure monitoring.
- Strong programming skills in Java and Python.
- Hands-on experience with AWS, including operational management of core services.
- Experience with CI/CD pipelines and container orchestration technologies.
- Familiarity with ITSM tools (ServiceNow, PagerDuty).
- Understanding of observability best practices, log correlation, and distributed tracing.
- Datadog certifications (APM, Logs, Fundamentals).
Responsibilities
- Design, implement, and manage end-to-end observability using Datadog APM, DBM, log pipelines, synthetic monitoring, and AI-driven alerting.
- Maintain production monitoring, respond to incidents, and lead root cause analysis using Datadog, Splunk, and ELK.
- Enhance automation and testing frameworks using Java, Spring Boot, Selenium, Cucumber, Playwright, and Jenkins.
- Operate AWS services including EC2, ECS, RDS, S3, DynamoDB, and Secrets Manager.
- Contribute to CI/CD practices and containerization technologies.
- Integrate monitoring with PagerDuty and ServiceNow for incident workflows.
- Participate in post-incident reviews, disaster recovery testing, and SRE process improvements.
Other
- 4 years of experience in SRE, DevOps, or platform engineering roles.
- Bachelor's degree in Computer Science or similar field of study
- Excellent troubleshooting, documentation, and communication skills.
- Exposure to other monitoring tools like Splunk, Dynatrace, or ELK.
- Knowledge of Agile/Scrum and globally distributed team collaboration.