Lilly is seeking an AI HPC Platform Engineer to accelerate the next era of AI and HPC innovation by enabling and supporting leading-edge AI/ML workloads using NVIDIA’s Run:ai platform and traditional HPC infrastructure.
Requirements
- Hands-on experience in HPC and AI platforms, including in-depth knowledge of accelerators (e.g., GPU), HPC schedulers (e.g., Altair Grid Engine, Slurm), Kubernetes platforms, and containers technologies (Docker, Apptainer).
- 6+ years of demonstrated experience in AI/ML and HPC workloads, infrastructure, and cluster architectures.
- Expertise in Linux system and HPC administration, including experience with platform observability (e.g., alerting, logging, and metrics).
- Knowledge of Run:ai core concepts, including roles, departments, projects, workloads, quotas, GPU fractions, and pre-emptible vs non-preemptible jobs.
- Experience with writing, building and running containers. Understanding of container registry management and using NGC images.
- Experience with machine learning frameworks such as PyTorch, Keras, and TensorFlow
- Strong programming and scripting skills in languages such as Python or Bash.
Responsibilities
- You will be driving the engineering and operations of design, build, and maintain scalable AI HPC platforms and collaborating on infrastructure for training and inference on large-scale, distributed GPU clusters.
- You will play a crucial role in boosting productivity for our Advanced Intelligence teams through advancing our AI and HPC infrastructure and experiences
- Collaborate with researchers and scientists to optimize performance and streamline workflows.
- Leverage tooling and automation for ML workflow orchestration, resource scheduling, data access, and reproducibility.
- Evolve and operate public cloud and on-premises environments with a focus on availability and performance for AI and HPC workloads.
- Define and monitor infrastructure metrics as well as ML-specific metrics, such as model efficiency, resource utilization, job success rates, among others.
Other
- You will bring a high learning agility and platform engineering skills to enable the Lilly Technology strategy, identifying opportunities to accelerate our AI journey.
- You will advance initiatives to enable critical business projects.
- You will have opportunities to leverage agile ways of working with a willingness to become an expert in deploying AI and HPC solutions.
- You will learn about new technologies in AI and HPC.
- Passion for continual learning and staying informed of new technologies, infrastructure trends, and approaches in the AI/ML field.