fal is looking for a Staff Technical Lead for Inference & ML Performance to guide a team in building and optimizing state-of-the-art inference systems for generative-media infrastructure, aiming to push the boundaries of model inference performance for seamless creative experiences at unprecedented scale.
Requirements
- Are deeply experienced in ML performance optimization. You've optimized inference for large-scale generative models in production environments.
- Understand the full ML performance stack. From PyTorch, TensorRT, TransformerEngine, Triton to CUTLASS kernels, you’ve navigated and optimized them all.
- Know inference inside-out. Expert-level familiarity with advanced inference techniques: quantization, kernel authoring, compilation, model parallelism (TP, context/sequence parallel, expert parallel), distributed serving and profiling.
- Experience building inference engines specifically for diffusion and generative media models
- Track record of industry-leading performance improvements (papers, open-source contributions, benchmarks)
Responsibilities
- Set technical direction. Guide your team (kernels, applied performance, ML compilers, distributed inference) to build high-performance inference solutions.
- Hands-on IC leadership. Personally contribute to critical inference performance enhancements and optimizations.
- Collaborate closely with research & applied ML teams. Influence model inference strategies and deployment techniques.
- Drive advanced performance optimizations. Implement model parallelism, kernel optimization, and compiler strategies.
- Mentor and scale your team. Coach and expand your team of performance-focused engineers.
Other
- Lead from the front. You're a respected IC who enjoys getting hands-on with the toughest problems, demonstrating excellence to inspire your team.
- Thrive in cross-functional collaboration. Comfortable interfacing closely with applied ML teams, researchers, and stakeholders.
- Leadership experience in scaling technical teams