OCI (Oracle Cloud Infrastructure) AI Infrastructure is looking to build a cutting-edge, ultra-high-performance GPU platform to support AI/ML/HPC workloads, and is seeking a software engineer to help design and develop fundamental architectural changes for GPU delivery, health monitoring, triage automation, and diagnostic services.
Requirements
- Deep understanding of operating systems, computer networks, and high-performance applications
- 4+ years’ experience delivering and operating large-scale production systems (1000+ server instances)
- Proficient in one programming language(java/python/c/c++/goLang/shell scripting)
- Strong background in Linux systems
- Familiarity with system-level architecture, data synchronization, fault tolerance, and state management
- General enterprise storage, networking, or computing experience
- Good understanding of databases and SQL (MySQL) and caching technologies (Redis, Memcache etc)
Responsibilities
- Designing, implementing, and delivering software, firmware for managing GPU based AI servers
- Working closely with partner teams to deliver high quality software to manage, triage and repair GPU systems
- Working closely with product teams to debug, resolve customer's issues
Other
- BS or MS degree in Computer Science or relevant technical field involving coding or equivalent practical experience
- Adaptable Engineers: Self-motivated individuals with a quick learning ability
- Collaborative Spirit: Comfortable working in a collaborative, agile environment and eager to learn
- Certain US customer or client-facing roles may be required to comply with applicable requirements, such as immunization and occupational health mandates
- Paid time off: Flexible Vacation is provided to all eligible employees assigned to a salaried (non-overtime eligible) position