Siemens Energy, Inc. is looking to build and optimize the end-to-end systems, data pipelines, and training processes for training foundation models for power grid applications to enable the rapid development and deployment of transformational AI solutions.
Requirements
- 5 or more years in a Data & AI (Artificial Intelligence) Engineer or Machine Learning Engineer, focusing on building and optimizing infrastructure for large-scale machine learning systems.
- Deep practical expertise with AI frameworks (PyTorch, Jax, Pytorch Lightning, etc.), large-scale multi-node GPU training, and optimization strategies for large foundation models on distributed compute infrastructure.
- Excellent problem-solving, debugging, and performance optimization skills, with a data-driven approach to identifying and resolving technical challenges.
- Strong communication and teamwork skills, experience with MLOps best practices for model tracking, evaluation, and deployment.
- Public GitHub profile with a track record of open-source contributions to data engineering or deep learning infrastructure projects
- experience writing CUDA/Triton/CUTLASS kernels
- proficiency with performance monitoring and profiling tools for distributed training and data pipelines.
Responsibilities
- Designing, building, and optimizing all aspects of large-scale training and fine-tuning, from dataloading to inference, to maximize Model Flop Utilization (MFU) on large compute clusters.
- Working closely and proactively with research scientists to translate models and algorithms into high-performance, production-ready code, integrating and testing the latest advancements.
- Relentlessly profiling and resolving training performance bottlenecks, optimizing the entire training stack for speed and efficiency.
- Contributing to the technology evaluations and selection of hardware, software, and cloud services for the AI infrastructure platform.
- Using MLOps frameworks (MLFlow, WnB, etc.) to ensure best practices across the model lifecycle, ensuring reproducibility, reliability, and continuous improvement.
- Creating thorough documentation for infrastructure and training procedures, staying updated on advancements in training strategies, and driving improvements in workflows and infrastructure.
Other
- Master's degree or higher in Computer Science, Engineering, or a related technical field.
- Candidates with more experience can be considered for a higher level or vice-versa.
- high-agency individual demonstrating initiative, problem-solving, and a commitment to delivering robust and scalable solutions for rapid prototyping and turnaround.
- Strong communication and teamwork skills
- Supportive work culture