Goldman Sachs' Procmon Platform needs to deliver a highly scalable and reliable ecosystem for scheduling business critical jobs across various functions, managing tens of millions of daily jobs and supporting large-scale systems like job scheduling, event streaming, log shipping, data warehouses, and security infrastructure.
Requirements
- Strong Linux fundamentals and system administration skills
- Good networking fundamentals (familiarity with TCP/IP, IP routing, firewalls, secure tunneling protocols)
- Experience working with distributed computing systems and Cloud computing environments
- Proficiency in at least one programming language; the team uses a mix of Go, Python and Erlang
Responsibilities
- Own technical operations for systems that manage hundreds of thousands of compute cores
- Build observability for new deployments to ensure robustness from day one, as well as mature deployments to identify and implement improvements
- Troubleshoot and resolve issues with block devices, file descriptors, and packet loss
- Lead real-time outage investigations and present postmortems to senior management
- Define SLIs and SLOs and partner with development teams to ensure system are sufficiently well designed and instrumented
- Partner with our development team throughout development and operations
- Plan and manage deployments and migrations (including end-of-life programs)
Other
- Excellent problem-solving and automation skills
- Able to operate effectively in a mission critical, highly regulated financial services environment
- Plan and implement robust business continuity and security programs
- Provide regional coverage for the Procmon platform and participate in the on-call support