Anthropic is looking to build reliable, interpretable, and steerable AI systems, and the ML Systems Engineer will be responsible for the critical algorithms and infrastructure that researchers depend on to train models like Claude, enabling breakthroughs in AI capabilities and safety.
Requirements
- High performance, large scale distributed systems
- Large scale LLM training
- Python
- Implementing LLM finetuning algorithms, such as RLHF
- Making changes to our finetuning systems so they work on new model architectures
- Building instrumentation to detect and eliminate Python GIL contention in our training code
- Diagnosing why training runs have started slowing down after some number of steps, and fixing it
Responsibilities
- implementing and improving advanced techniques to create ever more capable, reliable and steerable AI
- responsible for the critical algorithms and infrastructure that our researchers depend on to train models
- focus obsessively on improving the performance, robustness, and usability of these systems
- build, maintain, and improve the algorithms and systems that these researchers use to train models
- responsible for improving the speed, reliability, and ease-of-use of these systems
- Profiling our reinforcement learning pipeline to find opportunities for improvement
- Building a system that regularly launches training jobs in a test environment so that we can quickly detect problems in the training pipeline
Other
- Have 4+ years of software engineering experience
- Like working on systems and tools that make other people more productive
- Are results-oriented, with a bias towards flexibility and impact
- Pick up slack, even if it goes outside your job description
- Enjoy pair programming (we love to pair!)