Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

AI Performance Engineer (new graduates)

Cornelis Networks

Salary not specified

Dec 15, 2025

Remote, US

Cornelis Networks is looking for an AI Performance Engineer to optimize training and multi-node inference across next-gen networking silicon and systems for AI and HPC datacenters.

Requirements

Proven ability to set up, run, and analyze AI benchmarks; deep intuition for message passing, collectives, scaling efficiency, and bottleneck hunting for both training and low-latency serving.
Hands-on with distributed training beyond single-GPU (DP/TP/PP, ZeRO, FSDP, sharded optimizers) and distributed inference architectures (replicated vs sharded, tensor/KV parallel, MoE).
Practical experience across AI stacks & comms: PyTorch, DeepSpeed, Megatron-LM, PyTorch Lightning; RCCL/NCCL, MPI/Horovod; Triton Inference Server, vLLM, TensorRT-LLM, Ray Serve, KServe.
Comfortable with compilers (GCC/LLVM/Intel/OneAPI) and MPI stacks; Python + shell power user.
Familiarity with network architectures (Omni-Path/OPA, InfiniBand, Ethernet/RDMA/ROCE) and Linux systems at the performance-tuning level, including NIC offloads, CQ moderation, pacing, ECN/RED.
Hands-on profiling & tracing of GPU/comm paths (Nsight Systems, Nsight Compute, ROCm tools/rocprof/roctracer/omnitrace, VTune, perf, PCP, eBPF).
Experience with NeMo, DeepSpeed, Megatron-LM, FSDP, and collective ops analysis (AllReduce/AllGather/ReduceScatter/Broadcast).

Responsibilities

Own end-to-end performance for distributed AI workloads (training + multi-node inference) across multi-node clusters and diverse fabrics (Omni-Path, Ethernet, InfiniBand).
Benchmark, characterize, and tune open-source & industry workloads (e.g., Llama, Mixtral, diffusion, BERT/T5, MLPerf) on current and future compute, storage, and network hardware, including vLLM/TensorRT-LLM/Triton serving paths.
Design and optimize distributed serving topologies (sharded/replicated, tensor/pipe parallel, MoE expert placement), continuous/adaptive batching, KV-cache sharding/offload (CPU/NVMe) & prefix caching, and token streaming with tight p99/p999 SLOs.
Optimize inferencing: Validate RDMA/GPUDirect RDMA, congestion control, and collective/point-to-point tradeoffs during inference.
Design experiment plans to isolate scaling bottlenecks (collectives, kernel hot spots, I/O, memory, topology) and deliver clear, actionable deltas with latency-SLO dashboards and queuing analysis.
Build crisp proof points that compare Cornelis Omni-Path to competing interconnects; translate data into narratives for sales/marketing and lighthouse customers, including cost-per-token and tokens/sec-per-watt for serving.
Instrument and visualize performance (Nsight Systems, ROCm/Omnitrace, VTune, perf, eBPF, RCCL/NCCL tracing, app timers) plus serving telemetry (Prometheus/Grafana, OpenTelemetry traces, concurrency/queue depth).

Other

Excellent written and verbal communication—turn measurements into persuasion with SLO-driven narratives for inference.
This is a remote position for employees residing within the United States.
We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.