Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

DataDirect Networks (DDN) Logo

Sr Machine Learning Engineer - Infinia AI Performance

DataDirect Networks (DDN)

Salary not specified
Aug 21, 2025
Boston, MA, USA • Raleigh, NC, USA • Denver, CO, USA • Tucson, AZ, USA
Apply Now

DataDirect Networks (DDN) is seeking to optimize training, inference, and Retrieval-Augmented Generation (RAG) pipelines for high-performance AI applications

Requirements

  • Proven expertise in building and scaling AI/ML pipelines
  • Strong understanding of machine learning frameworks and libraries (TensorFlow, PyTorch, NVIDIA NeMo, vLLM, TensorRT-LLM)
  • Experience in deploying open-source vector databases at scale
  • Solid understanding of cloud infrastructure (AWS, GCP, Azure) and distributed computing
  • Proficiency with containerization tools (Docker, Kubernetes) and infrastructure as code
  • Implementation-level understanding of ML frameworks, data loaders and data formats
  • Experience with scaling RAG pipelines and integrating them with generative AI models

Responsibilities

  • Design and implement integration of data ingestion and streaming pipelines with open-source tools, like Ray Data, Mosaic Streaming, Tf.data, Torch Dataloader
  • Design of optimization for training like asynchronous checkpointing, and inference, like K-V caching and LORAX
  • Guide the integration of MLFlow with DDN’s Infinia product for comprehensive experiment tracking, model versioning, and deployment
  • Drive the implementation and scaling of Retrieval-Augmented Generation (RAG) pipelines to enhance generative model performance
  • Stay abreast of the latest developments in AIOps, AI frameworks, optimization, and accelerated execution
  • Identify and implement solutions to optimize training and inference pipeline performance, runtime, and resource utilization on Infinia

Other

  • Bachelor’s or Master’s degree in Computer Science, Data Science, Machine Learning, or related fields
  • 5+ years of experience in machine learning operations (MLOps) or related roles
  • Excellent problem-solving and troubleshooting skills, with attention to detail and performance optimization
  • Strong communication and collaboration skills
  • Participation in an on-call rotation to provide after-hours support as needed