Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Principal ML Engineer

Red Hat

$189,600 - $312,730

Oct 11, 2025

Raleigh, NC, US

Red Hat's OpenShift AI team is looking to build infrastructure that ensures large language models and AI agents are safe, reliable, and aligned with human values, democratizing trustworthy AI infrastructure and transforming how organizations develop, deploy, and monitor machine learning models.

Requirements

10+ years of ML engineering experience, with 3+ years focused on large-scale evaluation of transformer-based LLMs and/or agentic systems.
Proven experience building evaluation platforms or frameworks that operate across training, deployment, and post-deployment contexts.
Deep expertise in designing and implementing LLM evaluation metrics (factuality, hallucination detection, grounding, toxicity, robustness).
Strong background in scalable platform engineering, including APIs, pipelines, and integrations used by multiple product teams.
Demonstrated ability to bridge research and engineering, operationalizing safety and alignment techniques into production evaluation systems.
Proficiency in Python, PyTorch, Hugging Face, and modern ML ops/deployment environments.
Experience with multi-agent evaluation frameworks and graph-based metrics for agent interactions.

Responsibilities

Architect and lead development of large-scale evaluation platforms for LLMs and agents, enabling automated, reproducible, and extensible assessment of accuracy, reliability, safety, and performance across diverse domains.
Define organizational standards and metrics for LLM/agent evaluation, covering hallucination detection, factuality, bias, robustness, interpretability, and alignment drift.
Build platform components and APIs that allow product teams to integrate evaluation seamlessly into training, fine-tuning, deployment, and continuous monitoring workflows.
Design automated pipelines and benchmarks for adversarial testing, red-teaming, and stress testing of LLMs and retrieval-augmented generation (RAG) systems.
Lead initiatives in multi-dimensional evaluation, including safety (toxicity, bias, harmful outputs), grounding (retrieval correctness, source attribution), and agent behaviors (tool use, planning, trustworthiness).
Collaborate with cross-functional stakeholders (safety, product, research, infrastructure) to translate abstract evaluation goals into measurable, system-level frameworks.
Advance interpretability and observability, developing tools that allow teams to understand, debug, and explain LLM behaviors in production.

Other

Track record of technical leadership, including mentoring, architecture design, and defining org-wide practices.
Influence technical roadmaps and industry direction, representing the team’s evaluation-first approach in external forums and publications.
Contributions to AI safety or evaluation research in industry or academia.
Familiarity with adversarial testing methodologies and automated red-teaming.
Knowledge of interpretability and transparency methods for LLMs