An AI research organization is looking to design, evaluate, and curate machine learning tasks, datasets, and evaluation workflows that support the training and benchmarking of advanced AI models, specifically large language models (LLMs).
Requirements
- Minimum of 2 years of applied experience in machine learning.
- Strong proficiency in Python and modern ML frameworks (PyTorch or TensorFlow).
- Solid understanding of ML fundamentals, model evaluation, and optimization.
- Experience with ML pipelines, experiment tracking, and cloud environments.
- Experience creating ML benchmarks, evaluations, or challenge problems.
- Background in generative models, LLMs, or multimodal learning.
- Familiarity with MLOps tools (e.g., MLflow, Weights & Biases, Docker).
Responsibilities
- Design and frame machine learning tasks to evaluate and improve LLM capabilities.
- Build, train, and evaluate ML models across NLP, classification, prediction, and generative tasks.
- Conduct experimentation, performance analysis, and iterative improvement.
- Perform feature engineering, data preprocessing, and robustness testing.
- Implement evaluation metrics, benchmarking workflows, and bias analyses.
- Fine-tune and evaluate transformer-based models where applicable.
- Maintain clear documentation of datasets, experiments, and modeling decisions.
Other
- Technical degree in Computer Science, Engineering, Statistics, Mathematics, or a related field.
- Professional working proficiency in written and spoken English.
- Fully remote and asynchronous collaboration.
- Hourly contract engagement with flexible scheduling.
- Approximately 30–40 hours per week (flexible)