SF Compute is looking to solve the issue of compute being a commodity that startups and compute providers struggle with, by building a venue where compute contracts are traded in real-time and bringing traders into the supply chain
Requirements
- Strong software engineering background, with experience building fault-tolerant distributed systems
- Comfortable with Linux internals, debugging, and performance optimization
- Exposure to GPU/HPC clusters
- Networking literacy: familiar with eBGP, VXLAN, RoCEv2, and InfiniBand, plus an understanding of how to design software systems that dynamically leverage these fabrics
- Strong automation, scripting, and documentation skills
- Go or Rust experience (3+ years)
- Deep knowledge of HPC fabrics (InfiniBand, Ultra Ethernet, RoCEv2)
Responsibilities
- Design and operate orchestration frameworks to manage tens of thousands of GPUs across Kubernetes, virtualization, and bare metal
- Develop automation frameworks for large-scale provisioning, monitoring, and fault tolerance
- Build distributed systems that can withstand node or cluster-wide failures
- Architect software-defined networking solutions that integrate with underlay switches and support scalable designs
- Collaborate with networking specialists to ensure fabric resilience, low latency, and scalability, leveraging routing protocols like BGP where needed
- Integrate high-performance distributed storage with compute and networking layers
Other
- Bachelor's, Master's, or Ph.D. degree in Computer Science or related field
- Visa Sponsorships: yes, we sponsor visas and work permits
- Retirement matching: we match 401(k) plans up to 4%
- Medical, dental & vision: we offer competitive medical, dental, vision insurance for employees and dependents and cover 100% of premiums
- Unlimited paid time off as well as 10+ observed holidays