The Global Infrastructure Engineering AI & HPC team at Accenture is looking to solve the problem of enabling infrastructure reinvention for the next era of digital solutions powered by AI and High-Performance Computing (HPC) for their strategic and mission-critical clients.
Requirements
- Minimum 4+ year of hands-on experience designing, deploying, and managing HPC and AI infrastructure across on-premises, cloud, and hybrid environments
- Minimum 4+ years’ experience of accelerated computing architectures (GPUs, XPUs, DPUs), high-performance fabrics (InfiniBand, Ethernet), SONiC, networking, and modern storage/data platforms
- Minimum 4+ year experience with cluster management and orchestration (e.g. Slurm, Run:ai, Kubernetes, Docker), real-time performance monitoring, and observability frameworks
- Minimum 4+ years’ experience with cloud and virtualization platforms (e.g. AWS, Azure, GCP, VMware, Nutanix) and expertise in automation and optimization using scripting (Python, AI tools) with foundational Infrastructure-as-Code tools such as Terraform and Ansible
- Minimum 4+ year experience implementing MLOps and DevSecOps frameworks to enable secure, automated, and reproducible workflows
- Experience managing the deployment of 1,000+ GPU clusters for HPC and AI workloads with various infrastructure services enabled
- Experience with GPU computing libraries and accelerators (e.g., NVIDIA CUDA, Dynamo, AMD ROCm)
Responsibilities
- Design and implement HPC and AI infrastructure solutions, aligning system architecture and deployment roadmaps to industry-specific performance and scalability needs
- Deploy, configure, and manage XPU-based clusters (CPU/GPU/accelerators) using schedulers, VM/K8s orchestration platforms, Slurm, and containerized platforms in scalable designs to provide Metal as a Service (MaaS), GPUaaS, AIaaS, and other offerings
- Optimize cluster performance, scalability, energy, and cost efficiency across on-premises, cloud, and hybrid environments
- Integrate AI and HPC platforms with existing IT systems, data pipelines, and security frameworks
- Monitor, troubleshoot, and tune infrastructure to ensure high availability, low-latency networking, and workload resiliency
- Develop and maintain documentation including architecture diagrams, configuration baselines, and operational runbooks
- Provide technical guidance and support to users, enabling efficient execution of HPC/AI workloads, large-scale models, and simulations
Other
- Travel may be required for this role, with the amount of travel varying from 25% to 100% depending on business need and client requirements
- Bachelor's degree or equivalent (minimum 12 years) work experience
- Applicants for employment in the US must have work authorization that does not now or in the future require sponsorship of a visa for employment authorization in the United States
- Candidates who are currently employed by a client of Accenture or an affiliated Accenture business may not be eligible for consideration
- Job candidates will not be obligated to disclose sealed or expunged records of conviction or arrest as part of the hiring process