Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Sr. AI Observability Engineer (Remote)

Tealium

$165,000 - $200,000

Oct 6, 2025

Remote, US

Tealium is seeking a Senior AI Observability Engineer to lead the observability strategy for Tealium's AI/ML systems and AI-powered features, ensuring visibility, reliability, performance, and responsible usage of AI models across their products and internal platforms.

Requirements

Deep experience in instrumenting AI pipelines (e.g., LLMs, recommender systems, ML APIs) for observability, including drift detection and cost tracking.
Familiarity with prompt engineering, embeddings, vector DBs (Neptune), and RAG-style architectures.
Hands-on experience with OpenTelemetry, Datadog, Sumologic, Prometheus, or similar.
Experience integrating observability into AI platforms: e.g., Bedrock, Neptune, LangChain, LlamaIndex, HuggingFace, SageMaker, etc.
Proficiency with Python, Go, or similar languages used in backend and ML infrastructure.
Familiarity with AWS services (especially those relevant to AI: SageMaker, Bedrock, Lambda, DynamoDB, etc.).
Experience deploying and observing third-party LLM APIs (OpenAI, Claude, Amazon Q).

Responsibilities

Lead end-to-end observability design for AI/ML features in production and internal usage (e.g., RAG, Copilots, LLM-enhanced customer experiences).
Instrument AI features in Tealium products (e.g., ML-powered segmentation, decisioning, or predictions) for latency, accuracy, drift, usage, and cost.
Implement monitoring and cost tracking for third-party AI services (OpenAI, Anthropic Claude, Amazon Q, etc.), including rate limiting, quota management, and failover strategies
Build telemetry pipelines to track LLM request/response metrics, prompt engineering observability, token usage, hallucination detection, and failover.
Collaborate with data science and product teams to define and automate quality SLIs/SLOs for models.
Implement AI-aware tracing (e.g., OpenTelemetry + LangChain/LLM traces) into the broader observability stack.
Automate validation pipelines to ensure AI features are robust across environments.

Other

6+ years in Site Reliability Engineering, Observability Engineering, or ML Ops with a focus on production-grade AI/ML systems.
Strong background in Infrastructure-as-Code (Terraform, ArgoCD) and CI/CD tooling (Jenkins, GitHub Actions).
Understanding of Kubernetes and container orchestration.
Experience with FinOps/cost optimization for AI workloads
Strong understanding of ethical AI practices and responsible telemetry instrumentation. Additionally, Data Privacy and compliance experience