Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

hud (YC W25) Logo

Research Engineer, Agentic AI Evals

hud (YC W25)

Salary not specified
Sep 12, 2025
San Francisco, CA, US
Apply Now

People don't actually know if AI agents are working. To make AI agents work in the real world, HUD needs detailed evals for a huge range of tasks.

Requirements

  • Proficiency in Python, Docker, and Linux environments
  • React experience for frontend development
  • Production-level software development experience preferred
  • Strong technical aptitude and demonstrated problem-solving ability
  • Hands-on experience with LLM evaluation frameworks and methodologies
  • Contributed to evaluation harnesses (EleutherAI, Inspect, or similar)
  • Built custom evaluation pipelines or datasets

Responsibilities

  • Build out environments for HUD's CUA evaluation datasets, including evals for safety redteaming, general business tasks, long-horizon agentic tasks etc.
  • Deliver custom CUA datasets and evaluation pipelines requested by clients
  • Contribute to improving the HUD evaluation harness, depending on your interests, skills, and current organizational priorities.

Other

  • Startup experience in early-stage technology companies with ability to work independently in fast-paced environments
  • Strong communication skills for remote collaboration across time zones
  • Familiarity with current AI tools and LLM capabilities
  • Understanding of safety and alignment considerations in AI systems
  • Evidence of rapid learning and adaptability in technical environments (e.g. programming competitions)