JPMorgan Chase's Chief Data & Analytics Office (CDAO) is tasked with advancing the firm's data and analytics capabilities, ensuring data quality, security, and leveraging data for insights and decision-making. This role will contribute to developing and implementing AI/ML solutions to create new products, boost productivity, and enhance risk management.
Requirements
- Extensive experience with AWS Databricks platform administration and engineering support is a MUST.
- Strong understanding of SRE principles, including SLIs, SLOs, error budgets, and incident management.
- Experience with monitoring tools, automation frameworks, and CI/CD pipelines.
- Proficient in Python and/or Java application program development with use of automated unit testing.
- Experience with terraform development and understanding of terraform enterprise.
- Experience in delivering system design, application development, testing, and operational stability.
- Knowledge of Big Data distributed compute frameworks like Spark, Glue, MapReduce etc.
Responsibilities
- Designs, implements, and maintains a managed AWS Databricks platform, and provides engineering and operational support for the platform to SRE and app teams.
- Performs platform design, set-up and configuration, workspace administration, resource monitoring, providing engineering support to data engineering teams, Data Science/ML, and Application/integration teams.
- Leads evaluation sessions with external vendors, startups, and internal teams to drive outcomes-oriented probing of architectural designs, technical credentials, and applicability for use within existing systems and information architecture.
- Drives continuous improvement in system observability, alerting, and capacity planning.
- Collaborates with engineering and data teams to optimize infrastructure and deployment processes, focusing on automation and operational excellence.
- Executes creative software solutions, design, development, and technical troubleshooting with ability to think beyond routine or conventional approaches to build solutions or break down technical problems.
- Develops secure high-quality production code, and reviews and debugs code written by others.
Other
- Formal training or certification on software engineering concepts and 5+ years applied experience.
- Identifies opportunities to eliminate or automate remediation of recurring issues to improve overall operational stability of software applications and systems.
- Adds to team culture of diversity, opportunity, and respect.
- Implements Site Reliability Engineering (SRE) best practices to ensure reliability, scalability, and performance of data platforms.
- Develops and maintains incident response procedures, including root cause analysis and postmortem documentation.