GEICO AI ML Infra team is seeking an exceptional Senior ML Platform Engineer to build and scale our machine learning infrastructure with a focus on Large Language Models (LLMs) and AI applications.
Requirements
- Proficient in Python; strong skills in Go, Rust, or Java preferred
- Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
- Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
- Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
- Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
- Hands-on experience with inference optimization using vLLM, TensorRT-LLM, Triton Inference Server, or similar
- Advanced experience with Azure DevOps, GitHub Actions, Jenkins, or similar CI/CD platforms
Responsibilities
- Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
- Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
- Design, implement, and maintain feature stores for ML model training and inference pipelines
- Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
- Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
- Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
- Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
Other
- Excellent verbal and written communication skills with a proven ability to work independently and in a team environment.
- Mentor junior engineers and data scientists on platform best practices, infrastructure design, and ML operations
- Lead comprehensive code reviews focusing on scalability, reliability, security, and maintainability
- Work closely with data scientists to understand requirements and optimize workflows for model development and deployment
- At this time, GEICO will not sponsor a new applicant for employment authorization for this position.