Apple-scale AI enabled observability for Search, AIML Infrastructure, and Apple Intelligence products.
Requirements
- 7+ years software engineering experience and strong background in computer science: distributed systems, algorithms and data structures, APIs and highly-scalable, reliable systems and micro-services
- Strong coding skills in Go, Javascript, Java, Python
- Demonstrated experience in designing and building large scale enterprise observability solutions for data collection and storage, visualization and incident management
- Demonstrated experience in building visualization solutions and features with in-depth understanding of cloud-native visualization frameworks such as Grafana, Datadog
- Experience in observability collection solutions using time series metrics, distributed traces, logs and profiles with deep understanding of cloud-native technologies such as OpenTelemetry, Prometheus and Jaeger
- Demonstrated proficiency in AWS services such as EKS and native Kubernetes, storage such as S3, networking, database and observability / monitoring services
- Experience in building micro-services using public cloud infrastructure
Responsibilities
- design and build cloud-native solutions that empower observability for Search, AIML Infrastructure, and Apple Intelligence products.
- design, develop and deploy cutting-edge observability solutions for our AIML products and infrastructure.
- provide technical guidance, leverage AI pipelines, and mentor the team to deliver best-in-class solutions.
- designing and building large scale enterprise observability solutions for data collection and storage, visualization and incident management
- building visualization solutions and features with in-depth understanding of cloud-native visualization frameworks such as Grafana, Datadog
- Experience in observability collection solutions using time series metrics, distributed traces, logs and profiles with deep understanding of cloud-native technologies such as OpenTelemetry, Prometheus and Jaeger
- Demonstrated proficiency in AWS services such as EKS and native Kubernetes, storage such as S3, networking, database and observability / monitoring services
Other
- Excellent verbal and written communication skills with strong problem solving skills
- Excellent interpersonal skills for collaborating across teams, stakeholders, and open source collaborators
- Proven experience in delivering well-architected, reliable, highly-scalable cloud-native distributed systems for data management, observability or analytics services
- Building large-scale incident management, alert management and notification systems
- Experience using Gen AI LLMs and ML models for AI compute and model observability