Anthropic is looking to create reliable, interpretable, and steerable AI systems that are safe and beneficial for users and society, and is seeking a Research Engineer/Scientist to work on the Model Welfare program to better understand, evaluate, and address concerns about the potential welfare and moral status of AI systems.
Requirements
- Significant applied software, ML, or research engineering experience
- Experience contributing to empirical AI research projects and/or technical AI safety research
- Ability to reliably turn abstract theories into creative, tractable research hypotheses and experiments
- Familiarity with machine learning, NLP, AI safety, interpretability, and/or LLM psychology and behavior
- Experience with moral philosophy, cognitive science, neuroscience, or related fields (not required but a plus)
- Strong technical research engineering skills
- Ability to move fast and iterate rather than run long extensive projects
Responsibilities
- Investigate and improve the reliability of introspective self-reports from models
- Collaborate with Interpretability to explore potentially welfare-relevant features and circuits
- Improve and expand our welfare assessments for future frontier models
- Evaluate the presence of potentially welfare-relevant capabilities and characteristics as a function of model scale
- Develop strategies for making high-trust/verifiable commitments to models
- Explore possible interventions and deploy them into production (e.g. allowing models to end harmful or distressing interactions)
- Run technical research projects to investigate model characteristics of plausible relevance to welfare, consciousness, or related properties
Other
- At least a Bachelor's degree in a related field or equivalent experience
- Ability to work in the San Francisco office at least 25% of the time
- Strong communication skills
- Ability to work collaboratively with other teams
- Strong project management skills (a plus)