LPL Financial is seeking to drive the traceability and performance of business-critical transactions across multiple systems, ensuring system resilience and enhancing the advisor experience.
Requirements
5+ years in observability, SRE, or related roles with a focus on transaction monitoring and tracing
Hands-on experience with tools like Dynatrace, ELK, Datadog, Splunk, Open Telemetry, Jaeger, or equivalent
Expertise in monitoring critical transactions in cloud environments (AWS, Azure, or Google Cloud)
Strong understanding of microservices architecture, APIs, and distributed systems
Proficiency in scripting or programming languages (e.g., Python, Go, Java) for automation and integration.
Certifications: Dynatrace Associate or Professional Certification.
Experience with Open Telemetry and other observability standards.
Responsibilities
End-to-End Observability: Design and implement observability frameworks for end-to-end transaction traceability across microservices, APIs, databases, and third-party integrations. Leverage tools like Dynatrace, Open Telemetry, ELK, Grafana to trace transactions and visualize dependencies. Build actionable dashboards and alerts to provide real-time insights into transaction health and performance.
Performance Optimization: Monitor transaction latency, throughput, and error rates to identify bottlenecks and optimize performance. Use distributed tracing and telemetry data to analyze and resolve issues impacting transaction flows. Work with application and database teams to fine-tune configurations for better transaction efficiency
Collaboration & Governance: Partner with application teams, architects, and business stakeholders to define transaction observability and resiliency requirements. Develop and enforce standards for transaction monitoring and tracing across teams and environments. Provide training and guidance to teams on implementing best practices for observability and resiliency
Critical Transaction Resiliency: Identify and prioritize business-critical transaction flows across distributed systems. Develop strategies to ensure high availability and resilience for critical transactions. Implement failover mechanisms, redundancy strategies, and fault-tolerant designs for transaction paths. Collaborate with Site Reliability Engineering (SRE) and DevOps teams to conduct chaos engineering exercises to test resiliency.
Define and monitor Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for critical transaction paths.
Documentation & Reporting: Maintain comprehensive documentation of transaction flows, dependencies, and observability configurations. Provide regular reports on transaction health, performance trends, and resiliency improvements to leadership. Develop playbooks for handling transaction-related incidents and outages.
Achieve a 30% reduction in MTTD and MTTR within the first year of operation, demonstrating the effectiveness of the SRE capabilities, observability and self-healing
Other
Strong collaborators who can deliver a world-class client experience
Ability to thrive in a fast-paced environment
Client-focused and team-oriented
Ability to execute in a way that encourages creativity and continuous improvement