Gen AI Engineer creating and developing LLMs from test to production for massive LLMs (B+ parameters) at a high-impact environment
Requirements
- Deep expertise in distributed training frameworks (DeepSpeed, Megatron-LM, PyTorch FSDP, TensorFlow Mesh, JAX/TPU)
- Proficiency with parallelism strategies (data, tensor, pipeline) and mixed precision training
- Experience with large-scale cloud or HPC environments (AWS, Azure, GCP, Slurm, Kubernetes, Ray)
- Strong skills in Python, CUDA, and performance optimization
- Experience with LLM fine-tuning (RLHF, LoRA, PEFT)
- Familiarity with tokenizer development and multilingual pretraining
- Knowledge of scaling laws and model evaluation frameworks for massive LLMs
Responsibilities
- Architect and implement large-scale training pipelines for LLMs with B+ parameters
- Optimize distributed training performance across thousands of GPUs/TPUs
- Collaborate with research scientists to translate experimental results into production-grade training runs
- Manage and preprocess petabyte-scale datasets for pretraining
- Implement state-of-the-art techniques in scaling laws, model parallelism, and memory optimization
- Conduct rigorous benchmarking, profiling, and performance tuning
- Contribute to Client research in LLM architecture, training stability, and efficiency
Other
- Advanced degree (PhD or Masters) in Computer Science, Machine Learning, or related field from a top global university in CS
- United States Employment Opportunities Only
- E-Verify is required to confirm an individual's employment eligibility to work in the United States
- Strong publication record in top-tier ML/AI venues (NeurIPS, ICML, ICLR, ACL, etc.) preferred
-
- years of hands-on experience with large-scale deep learning model training