Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

ML Engineer

Wiraa

Salary not specified

Oct 16, 2025

Remote, US

Red Hat is seeking to advance the safety, reliability, and ethical alignment of large language models (LLMs) and AI agents through the development of large-scale evaluation platforms. This initiative aims to ensure AI systems are trustworthy and aligned with human values, supporting Red Hat's mission in enterprise open source software and responsible AI innovation.

Requirements

10+ years of experience in machine learning engineering, with at least 3 years focused on large-scale evaluation of transformer-based LLMs and/or agentic systems.
Proven experience in building evaluation platforms or frameworks that operate across training, deployment, and post-deployment environments.
Deep expertise in designing and implementing evaluation metrics such as factuality, hallucination detection, grounding, toxicity, and robustness.
Strong background in scalable platform engineering, including development of APIs, pipelines, and integrations used by multiple product teams.
Demonstrated ability to operationalize safety and alignment techniques into production evaluation systems.
Proficiency in Python, PyTorch, Hugging Face, and modern ML operations and deployment environments.
Experience in technical leadership, mentoring, architecture design, and establishing organization-wide best practices.

Responsibilities

Architect and lead the development of large-scale evaluation platforms for LLMs and AI agents, enabling comprehensive assessment of accuracy, safety, and performance.
Define organizational standards and metrics for evaluation, including hallucination detection, factuality, bias, robustness, interpretability, and alignment drift.
Develop platform components and APIs that facilitate seamless integration of evaluation processes into training, fine-tuning, deployment, and continuous monitoring workflows.
Design automated pipelines and benchmarks for adversarial testing, red-teaming, and stress testing of LLMs and retrieval-augmented generation (RAG) systems.
Lead initiatives in multi-dimensional evaluation, focusing on safety, grounding, and agent behavior metrics.
Collaborate with cross-functional stakeholders to translate abstract evaluation goals into practical, system-level frameworks.
Advance interpretability and observability tools to enable teams to understand, debug, and explain LLM behaviors in production environments.

Other

Advanced degree in Machine Learning, Computer Science, or related fields with a focus on evaluation, safety, or interpretability is preferred.
Mentor engineers, promote best practices, and drive the adoption of evaluation-driven development methodologies.
Represent the team’s evaluation-first approach in external forums, publications, and industry conferences, influencing the future direction of AI safety and evaluation standards.
Collaborate with cross-functional teams, including safety, research, product, and infrastructure.
Contribute to open-source projects that democratize trustworthy AI infrastructure.