Anthropic is looking to develop and optimize the encodings and tokenization systems used throughout their Finetuning workflows to enable more efficient and effective training of their AI systems, ensuring they remain reliable, interpretable, and steerable.
Requirements
- Have significant software engineering experience with demonstrated machine learning expertise
- Have experience with machine learning systems, data pipelines, or ML infrastructure
- Are proficient in Python and familiar with modern ML development practices
- Have strong analytical skills and can evaluate the impact of engineering changes on research outcomes
- Working with machine learning data processing pipelines
- Building or optimizing data encodings for ML applications
- Implementing or working with BPE, WordPiece, or other tokenization algorithms
Responsibilities
- Design, develop, and maintain tokenization systems used across Pretraining and Finetuning workflows
- Optimize encoding techniques to improve model training efficiency and performance
- Collaborate closely with research teams to understand their evolving needs around data representation
- Build infrastructure that enables researchers to experiment with novel tokenization approaches
- Implement systems for monitoring and debugging tokenization-related issues in the model training pipeline
- Create robust testing frameworks to validate tokenization systems across diverse languages and data types
- Identify and address bottlenecks in data processing pipelines related to tokenization
Other
- Are comfortable navigating ambiguity and developing solutions in rapidly evolving research environments
- Can work independently while maintaining strong collaboration with cross-functional teams
- Are results-oriented, with a bias towards flexibility and impact
- Pick up slack, even if it goes outside your job description
- Enjoy pair programming (we love to pair!)