Tealium is seeking a Senior AI Observability Engineer to lead the observability strategy for Tealium's AI/ML systems and AI-powered features, ensuring visibility, reliability, performance, and responsible usage of AI models across their products and internal platforms.
Requirements
- Deep experience in instrumenting AI pipelines (e.g., LLMs, recommender systems, ML APIs) for observability, including drift detection and cost tracking.
- Familiarity with prompt engineering, embeddings, vector DBs (Neptune), and RAG-style architectures.
- Hands-on experience with OpenTelemetry, Datadog, Sumologic, Prometheus, or similar.
- Experience integrating observability into AI platforms: e.g., Bedrock, Neptune, LangChain, LlamaIndex, HuggingFace, SageMaker, etc.
- Proficiency with Python, Go, or similar languages used in backend and ML infrastructure.
- Familiarity with AWS services (especially those relevant to AI: SageMaker, Bedrock, Lambda, DynamoDB, etc.).
- Experience deploying and observing third-party LLM APIs (OpenAI, Claude, Amazon Q).
Responsibilities
- Lead end-to-end observability design for AI/ML features in production and internal usage (e.g., RAG, Copilots, LLM-enhanced customer experiences).
- Instrument AI features in Tealium products (e.g., ML-powered segmentation, decisioning, or predictions) for latency, accuracy, drift, usage, and cost.
- Implement monitoring and cost tracking for third-party AI services (OpenAI, Anthropic Claude, Amazon Q, etc.), including rate limiting, quota management, and failover strategies
- Build telemetry pipelines to track LLM request/response metrics, prompt engineering observability, token usage, hallucination detection, and failover.
- Collaborate with data science and product teams to define and automate quality SLIs/SLOs for models.
- Implement AI-aware tracing (e.g., OpenTelemetry + LangChain/LLM traces) into the broader observability stack.
- Automate validation pipelines to ensure AI features are robust across environments.
Other
- 6+ years in Site Reliability Engineering, Observability Engineering, or ML Ops with a focus on production-grade AI/ML systems.
- Strong background in Infrastructure-as-Code (Terraform, ArgoCD) and CI/CD tooling (Jenkins, GitHub Actions).
- Understanding of Kubernetes and container orchestration.
- Experience with FinOps/cost optimization for AI workloads
- Strong understanding of ethical AI practices and responsible telemetry instrumentation. Additionally, Data Privacy and compliance experience