People don't actually know if AI agents are working. To make AI agents work in the real world, HUD needs detailed evals for a huge range of tasks.
Requirements
- Proficiency in Python, Docker, and Linux environments
- React experience for frontend development
- Production-level software development experience preferred
- Strong technical aptitude and demonstrated problem-solving ability
- Hands-on experience with LLM evaluation frameworks and methodologies
- Contributed to evaluation harnesses (EleutherAI, Inspect, or similar)
- Built custom evaluation pipelines or datasets
Responsibilities
- Build out environments for HUD's CUA evaluation datasets, including evals for safety redteaming, general business tasks, long-horizon agentic tasks etc.
- Deliver custom CUA datasets and evaluation pipelines requested by clients
- Contribute to improving the HUD evaluation harness, depending on your interests, skills, and current organizational priorities.
Other
- Startup experience in early-stage technology companies with ability to work independently in fast-paced environments
- Strong communication skills for remote collaboration across time zones
- Familiarity with current AI tools and LLM capabilities
- Understanding of safety and alignment considerations in AI systems
- Evidence of rapid learning and adaptability in technical environments (e.g. programming competitions)