Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Sr Machine Learning Engineer - Infinia AI Performance

DataDirect Networks (DDN)

Salary not specified

Aug 21, 2025

Boston, MA, USA • Raleigh, NC, USA • Denver, CO, USA • Tucson, AZ, USA

DataDirect Networks (DDN) is seeking to optimize training, inference, and Retrieval-Augmented Generation (RAG) pipelines for high-performance AI applications

Requirements

Proven expertise in building and scaling AI/ML pipelines
Strong understanding of machine learning frameworks and libraries (TensorFlow, PyTorch, NVIDIA NeMo, vLLM, TensorRT-LLM)
Experience in deploying open-source vector databases at scale
Solid understanding of cloud infrastructure (AWS, GCP, Azure) and distributed computing
Proficiency with containerization tools (Docker, Kubernetes) and infrastructure as code
Implementation-level understanding of ML frameworks, data loaders and data formats
Experience with scaling RAG pipelines and integrating them with generative AI models

Responsibilities

Design and implement integration of data ingestion and streaming pipelines with open-source tools, like Ray Data, Mosaic Streaming, Tf.data, Torch Dataloader
Design of optimization for training like asynchronous checkpointing, and inference, like K-V caching and LORAX
Guide the integration of MLFlow with DDN’s Infinia product for comprehensive experiment tracking, model versioning, and deployment
Drive the implementation and scaling of Retrieval-Augmented Generation (RAG) pipelines to enhance generative model performance
Stay abreast of the latest developments in AIOps, AI frameworks, optimization, and accelerated execution
Identify and implement solutions to optimize training and inference pipeline performance, runtime, and resource utilization on Infinia

Other

Bachelor’s or Master’s degree in Computer Science, Data Science, Machine Learning, or related fields
5+ years of experience in machine learning operations (MLOps) or related roles
Excellent problem-solving and troubleshooting skills, with attention to detail and performance optimization
Strong communication and collaboration skills
Participation in an on-call rotation to provide after-hours support as needed