Bloomberg's Core Communications (CC) products, including IB (Instant Bloomberg) and MSG (Message), handle billions of financial messages daily. The SRE team needs to ensure the reliability, stability, observability, and scalability of these critical systems, which form the backbone of financial dialogue.
Requirements
- 4+ years of experience in software engineering, and experience working on a SRE team
- Proficiency in Python and proven experience with C++
- Strong understanding of distributed systems and system reliability
- Familiarity with SLOs, SLIs, and SLAs, and how to relate system performance back to client impact
- Hands-on experience with monitoring and alerting tools (e.g., Grafana, Splunk, distributed tracing)
- Experience with chaos engineering, failure injection, or resilience testing frameworks
- Exposure to capacity planning and scaling analysis
Responsibilities
- Define and promote reliability-focused standards and best practices across observability, alerting, incident response, and provisioning
- Build and maintain troubleshooting tools leveraging distributed tracing and health signals to accelerate root cause analysis
- Partner with Product teams to define and measure meaningful SLOs aligned with user experience
- Lead initiatives to identify and mitigate reliability risks across CC systems — spanning performance, capacity, and resiliency
- Collaborate with developers to embed reliability into the software development lifecycle, from design through deployment
- Contribute to the creation of a culture of reliability by advocating for failure-aware design and sharing best practices across teams
- Develop automation to reduce manual operational effort and support scalable, safe growth of our infrastructure
Other
- Strong collaboration and communication skills
- A degree in Computer Science, Engineering, or equivalent practical experience
- An interest in treating security as part of reliability
- Contributions to open source or involvement in SRE communities
- Awareness of industry compliance frameworks (e.g., DORA, SOC 2) and how they relate to system reliability