Design, build, and optimize the high-performance networking infrastructure powering AI/ML operations in Toronto, managing InfiniBand and ultra-high-speed Ethernet fabrics that connect NVIDIA H100 and A100 GPUs, over 20PB of Ceph storage, and hundreds of servers.
Requirements
- Hands-on experience with high-speed networking (100Gb+ Ethernet and InfiniBand)
- Hands-on experience with network security (firewalls, ACLs, network segmentation)
- Experience with InfiniBand fabrics including RDMA, RoCE, IPoIB
- Strong understanding of L2/L3 networking protocols (TCP/IP, BGP, OSPF, VLANs)
- Experience optimizing networks for GPU-to-GPU communication
- Experience with network automation tools
- Familiarity with network monitoring and observability tools (Prometheus, Grafana)
Responsibilities
- Configure and maintain InfiniBand and high-speed Ethernet fabrics
- Optimize network performance for RDMA, and GPU-to-GPU communication
- Manage network switches (Mellanox, NVIDIA, Micas Networks)
- Troubleshoot network bottlenecks and latency issues
- Plan and execute network upgrades and expansions
- Network security implementation (firewalls, VLANs, ACLs)
- Infrastructure monitoring
Other
- 4+ years of network engineering experience in production environments
- Knowledge of HPC network topologies
- Strong troubleshooting and problem-solving skills
- Experience in data center environments or AI/ML infrastructure
- If you're a natural problem-solver with a passion for continuous learning, we'd love to hear from you.