OpenAI's Workload team needs to design and implement dataset infrastructure for next-generation LLM training stacks, requiring standardized interfaces, scalable pipelines across thousands of GPUs, and proactive performance bottleneck testing to ensure efficient and reliable model training.
Requirements
- Have strong engineering fundamentals with experience in distributed systems, data pipelines, or infrastructure.
- Have experience building APIs, modular code, and scalable abstractions, while recognizing that abstractions ultimately serve the users and UX is an important part of the abstractions design.
- Are comfortable debugging bottlenecks across large fleets of machines.
- Have background knowledge in data math, probability, or distributed data theory.
- Have worked with GPU-scale distributed systems or dataset scaling for real-time data
Responsibilities
- Design and maintain standardized dataset APIs, including for multimodal (MM) data that cannot fit in memory.
- Build proactive testing and scale validation pipelines for dataset loading at GPU scale.
- Collaborate with teammates to integrate datasets seamlessly into training and inference pipelines, ensuring smooth adoption and a great user experience.
- Document and maintain dataset interfaces so they are discoverable, consistent, and easy for other teams to adopt.
- Establish safeguards and validation systems to ensure datasets remain reproducible and unchanged once standardized.
- Debug and resolve performance bottlenecks in distributed dataset loading (e.g., straggler systems slowing global training).
- Provide visualization and inspection tools to surface errors, bugs, or bottlenecks in datasets.
Other
- Take pride in building infrastructure that “just works,” and find joy in being the guardian of reliability and scale.
- Are collaborative, humble, and excited to own a foundational (if not glamorous) part of the ML stack.