Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Senior AI Evaluation Scientist

Steampunk, Inc.

$140,000 - $190,000

Nov 22, 2025

McLean, VA, United States of America

Steampunk is seeking to design and lead rigorous evaluation programs for predictive and generative AI systems across their enterprise and client engagements to ensure AI solutions are accurate, reliable, safe, and aligned with mission outcomes.

Requirements

8+ years of experience evaluating machine learning, NLP, or generative AI systems, with strong familiarity with LLMs and retrieval-based architectures.
Deep understanding of evaluation metrics, statistical testing, dataset construction, experimental design, and model validation methodologies.
Hands-on experience with Python and libraries such as PyTorch, Hugging Face, LangChain, scikit-learn, and evaluation tooling (LLM-as-a-judge, rubric-based evaluators, or custom harnesses).
Demonstrated experience designing automated evaluation pipelines and integrating them into CI/CD or LLMOps workflows.
Strong understanding of AI governance, responsible AI principles, bias detection, fairness metrics, and risk identification.
Experience working with structured and unstructured datasets across multiple modalities (text, tabular, documents).
Familiarity with vector databases, RAG architectures, and multi-step LLM workflows.

Responsibilities

Lead the design and implementation of comprehensive evaluation frameworks for generative and predictive AI models, including accuracy, robustness, relevance, trustworthiness, fairness, hallucination rates, and safety.
Develop and maintain automated evaluation pipelines that continuously audit model outputs, monitor quality drift, and validate alignment with mission-specific constraints.
Create custom benchmark datasets, challenge sets, and adversarial evaluation strategies tailored to client domains and regulatory requirements.
Conduct in-depth error analysis, model behavior studies, and sensitivity assessments to inform iterative improvements in prompts, retrieval systems, models, and orchestration frameworks.
Partner with AI Product Engineers, LLMOps Engineers, and Data Scientists to drive model improvements through structured experimentation, A/B testing, and scientifically grounded evaluation cycles.
Advise teams on measurement methodologies, statistical significance, and best practices for Trustworthy AI evaluation in alignment with NIST AI RMF, MLSecOps, and agency governance requirements.
Document evaluation results, risks, and findings for technical and non-technical audiences, including engineering teams, leadership, and government clients.

Other

Ability to hold a position of public trust with the U.S. government.
Bachelor’s, Master’s, or Ph.D. in Computer Science, Statistics, Machine Learning, Cognitive Science, Human-Computer Interaction, or a related field.
Excellent analytical, written, and verbal communication skills, with the ability to translate evaluation insights into clear technical recommendations.
Proven ability to collaborate with cross-functional engineering and product teams while independently driving evaluation strategy.
Experience working in agile or iterative development environments and documenting scientific processes clearly.