Cornelis Networks is looking for an AI Performance Engineer to optimize training and multi-node inference across next-gen networking silicon and systems for AI and HPC datacenters.
Requirements
- Proven ability to set up, run, and analyze AI benchmarks; deep intuition for message passing, collectives, scaling efficiency, and bottleneck hunting for both training and low-latency serving.
- Hands-on with distributed training beyond single-GPU (DP/TP/PP, ZeRO, FSDP, sharded optimizers) and distributed inference architectures (replicated vs sharded, tensor/KV parallel, MoE).
- Practical experience across AI stacks & comms: PyTorch, DeepSpeed, Megatron-LM, PyTorch Lightning; RCCL/NCCL, MPI/Horovod; Triton Inference Server, vLLM, TensorRT-LLM, Ray Serve, KServe.
- Comfortable with compilers (GCC/LLVM/Intel/OneAPI) and MPI stacks; Python + shell power user.
- Familiarity with network architectures (Omni-Path/OPA, InfiniBand, Ethernet/RDMA/ROCE) and Linux systems at the performance-tuning level, including NIC offloads, CQ moderation, pacing, ECN/RED.
- Hands-on profiling & tracing of GPU/comm paths (Nsight Systems, Nsight Compute, ROCm tools/rocprof/roctracer/omnitrace, VTune, perf, PCP, eBPF).
- Experience with NeMo, DeepSpeed, Megatron-LM, FSDP, and collective ops analysis (AllReduce/AllGather/ReduceScatter/Broadcast).
Responsibilities
- Own end-to-end performance for distributed AI workloads (training + multi-node inference) across multi-node clusters and diverse fabrics (Omni-Path, Ethernet, InfiniBand).
- Benchmark, characterize, and tune open-source & industry workloads (e.g., Llama, Mixtral, diffusion, BERT/T5, MLPerf) on current and future compute, storage, and network hardware, including vLLM/TensorRT-LLM/Triton serving paths.
- Design and optimize distributed serving topologies (sharded/replicated, tensor/pipe parallel, MoE expert placement), continuous/adaptive batching, KV-cache sharding/offload (CPU/NVMe) & prefix caching, and token streaming with tight p99/p999 SLOs.
- Optimize inferencing: Validate RDMA/GPUDirect RDMA, congestion control, and collective/point-to-point tradeoffs during inference.
- Design experiment plans to isolate scaling bottlenecks (collectives, kernel hot spots, I/O, memory, topology) and deliver clear, actionable deltas with latency-SLO dashboards and queuing analysis.
- Build crisp proof points that compare Cornelis Omni-Path to competing interconnects; translate data into narratives for sales/marketing and lighthouse customers, including cost-per-token and tokens/sec-per-watt for serving.
- Instrument and visualize performance (Nsight Systems, ROCm/Omnitrace, VTune, perf, eBPF, RCCL/NCCL tracing, app timers) plus serving telemetry (Prometheus/Grafana, OpenTelemetry traces, concurrency/queue depth).
Other
- Excellent written and verbal communication—turn measurements into persuasion with SLO-driven narratives for inference.
- This is a remote position for employees residing within the United States.
- We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.