Companies want to train their own large models on their own data, but the current industry standard of training on a random sample is inefficient and can harm model quality. DatologyAI aims to translate research into tools that enable enterprise customers to identify the right data for training, resulting in better models for cheaper.
Requirements
- Strong programming skills in Python and familiarity with libraries like PyTorch, TensorFlow, or JAX
- Solid understanding of data structures, algorithms, and ML fundamentals
Responsibilities
- Build and improve components of our ML training and data curation pipelines
- Prototype and evaluate algorithms that identify informative data samples at scale
- Work with researchers to bring new data selection and model evaluation techniques into production
- Contribute to reliable and efficient distributed ML systems
- Learn how to take an idea from research to real-world deployment
Other
- Pursuing a BS, MS, or PhD in Computer Science, Electrical Engineering, or a related field
- Curious about large-scale training systems, data curation, and the infrastructure behind AI models
- Eager to learn from experienced engineers and contribute to production-quality code
- Collaborative, detail-oriented, and driven by curiosity
- This role is based in Redwood City, CA. We are in office 4 days a week.