Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Tealium Logo

Sr. AI Observability Engineer (Remote)

Tealium

$165,000 - $200,000
Oct 6, 2025
Remote, US
Apply Now

Tealium is seeking a Senior AI Observability Engineer to lead the observability strategy for Tealium's AI/ML systems and AI-powered features, ensuring visibility, reliability, performance, and responsible usage of AI models across their products and internal platforms.

Requirements

  • Deep experience in instrumenting AI pipelines (e.g., LLMs, recommender systems, ML APIs) for observability, including drift detection and cost tracking.
  • Familiarity with prompt engineering, embeddings, vector DBs (Neptune), and RAG-style architectures.
  • Hands-on experience with OpenTelemetry, Datadog, Sumologic, Prometheus, or similar.
  • Experience integrating observability into AI platforms: e.g., Bedrock, Neptune, LangChain, LlamaIndex, HuggingFace, SageMaker, etc.
  • Proficiency with Python, Go, or similar languages used in backend and ML infrastructure.
  • Familiarity with AWS services (especially those relevant to AI: SageMaker, Bedrock, Lambda, DynamoDB, etc.).
  • Experience deploying and observing third-party LLM APIs (OpenAI, Claude, Amazon Q).

Responsibilities

  • Lead end-to-end observability design for AI/ML features in production and internal usage (e.g., RAG, Copilots, LLM-enhanced customer experiences).
  • Instrument AI features in Tealium products (e.g., ML-powered segmentation, decisioning, or predictions) for latency, accuracy, drift, usage, and cost.
  • Implement monitoring and cost tracking for third-party AI services (OpenAI, Anthropic Claude, Amazon Q, etc.), including rate limiting, quota management, and failover strategies
  • Build telemetry pipelines to track LLM request/response metrics, prompt engineering observability, token usage, hallucination detection, and failover.
  • Collaborate with data science and product teams to define and automate quality SLIs/SLOs for models.
  • Implement AI-aware tracing (e.g., OpenTelemetry + LangChain/LLM traces) into the broader observability stack.
  • Automate validation pipelines to ensure AI features are robust across environments.

Other

  • 6+ years in Site Reliability Engineering, Observability Engineering, or ML Ops with a focus on production-grade AI/ML systems.
  • Strong background in Infrastructure-as-Code (Terraform, ArgoCD) and CI/CD tooling (Jenkins, GitHub Actions).
  • Understanding of Kubernetes and container orchestration.
  • Experience with FinOps/cost optimization for AI workloads
  • Strong understanding of ethical AI practices and responsible telemetry instrumentation. Additionally, Data Privacy and compliance experience