Amazon is looking to support the development of industry-leading multimodal large-language foundational models (LLMs) by advancing the state of the art through novel algorithms and modeling techniques, leveraging heterogeneous data sources and large-scale computing resources to accelerate development with multimodal LLMs and Generative AI.
Requirements
- Experience programming with at least one software programming language
- PhD or Master's degree in machine learning or equivalent
- 2+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
- Hands-on experience and expertise in training Foundational Models/LLMs, and/or low-level optimization of ML training workflows, CUDA kernels, network I/O.
Responsibilities
- Responsible for pre and post-training multimodal LLMs.
- Scale training of models on hyper large GPU and AWS Trainium clusters
- Optimize training workflows using distributed training/parallelism techniques
- Optimize low-level details of the training stack, including CUDA kernels, communication collectives, network I/O.
- Utilize, build and extend upon industry leading frameworks (NeMo, Megatron Core, PyTorch, Jax, vLLM, TRT, etc)
- Work with other team members to investigate design approaches, prototype new technology, scientific techniques and evaluate technical feasibility
- Deliver results independently in a self organizing Agile environment while constantly embracing and adapting new scientific advances
Other
- passionate about new opportunities
- track record of success in delivering new features and products
- commitment to team work, hustle, and strong communication skills (to both business and technical partners)
- thrived and succeeded in delivering high quality technology products/services in a hyper-growth environment where priorities shift fast
- work safely and cooperatively with other employees, supervisors, and staff