Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Senior AI Engineer - Foundation Model Training & Infrastructure for Power Grids (US)

Siemens Energy

Salary not specified

Sep 19, 2025

Remote, US

Siemens Energy, Inc. is looking to build and optimize the end-to-end systems, data pipelines, and training processes for training foundation models for power grid applications to enable the rapid development and deployment of transformational AI solutions.

Requirements

5 or more years in a Data & AI (Artificial Intelligence) Engineer or Machine Learning Engineer, focusing on building and optimizing infrastructure for large-scale machine learning systems.
Deep practical expertise with AI frameworks (PyTorch, Jax, Pytorch Lightning, etc.), large-scale multi-node GPU training, and optimization strategies for large foundation models on distributed compute infrastructure.
Excellent problem-solving, debugging, and performance optimization skills, with a data-driven approach to identifying and resolving technical challenges.
Strong communication and teamwork skills, experience with MLOps best practices for model tracking, evaluation, and deployment.
Public GitHub profile with a track record of open-source contributions to data engineering or deep learning infrastructure projects
experience writing CUDA/Triton/CUTLASS kernels
proficiency with performance monitoring and profiling tools for distributed training and data pipelines.

Responsibilities

Designing, building, and optimizing all aspects of large-scale training and fine-tuning, from dataloading to inference, to maximize Model Flop Utilization (MFU) on large compute clusters.
Working closely and proactively with research scientists to translate models and algorithms into high-performance, production-ready code, integrating and testing the latest advancements.
Relentlessly profiling and resolving training performance bottlenecks, optimizing the entire training stack for speed and efficiency.
Contributing to the technology evaluations and selection of hardware, software, and cloud services for the AI infrastructure platform.
Using MLOps frameworks (MLFlow, WnB, etc.) to ensure best practices across the model lifecycle, ensuring reproducibility, reliability, and continuous improvement.
Creating thorough documentation for infrastructure and training procedures, staying updated on advancements in training strategies, and driving improvements in workflows and infrastructure.

Other

Master's degree or higher in Computer Science, Engineering, or a related technical field.
Candidates with more experience can be considered for a higher level or vice-versa.
high-agency individual demonstrating initiative, problem-solving, and a commitment to delivering robust and scalable solutions for rapid prototyping and turnaround.
Strong communication and teamwork skills
Supportive work culture