Infrastructure Engineer for Advanced AI and HPC ...
Accenture - Des Moines, IA
Apply NowJob Description
We Are: The Global Infrastructure Engineering AI & HPC team is at the forefront of creating the foundation for innovative advancements in AI and High-Performance Computing (HPC). Our expert team combines vast technical knowledge across cloud, on-premise, and hybrid environments to build and maintain advanced infrastructure capable of supporting demanding high-performance workloads. We empower our clients to achieve exceptional performance, efficiency, and creativity through groundbreaking solutions that span the project lifecycle-from strategic planning and architecture to implementation and ongoing management-driving modernization initiatives across the infrastructure landscape. By collaborating with a wide array of partners, we leverage emerging technologies to promote growth and revolutionize industries. In this dynamic landscape, we are leading enterprises to harness AI and HPC for transformative innovations that enhance infrastructure capabilities. Key Responsibilities: Design and implement scalable and robust infrastructure solutions for HPC and AI, meeting industry standards for performance. Deploy, configure, and manage clusters utilizing XPU (CPU/GPU/accelerators) technologies through orchestration platforms, schedulers, and containerized services to enable Metal as a Service (MaaS), GPU as a Service (GPUaaS), and AI as a Service (AIaaS). Optimize clusters for performance, scalability, energy efficiency, and cost-effectiveness across on-premises, cloud, and hybrid environments. Integrate AI and HPC platforms seamlessly with existing IT systems, data pipelines, and security measures. Monitor, troubleshoot, and enhance infrastructure to guarantee high availability, low-latency networking, and dependable workloads. Create and keep updated comprehensive documentation including architecture diagrams, configuration guidelines, and operational manuals. Offer technical support and expertise to users, improving the execution of HPC/AI tasks, simulations, and large models. Travel may be required for this role, ranging from 25% to 100% depending on business needs and client requirements. Required Skills and Qualifications: A minimum of 4 years of hands-on experience in the design, deployment, and management of HPC and AI infrastructure across various environments, including working with hyperscalers, neocloud, large enterprises, and critical sectors such as Financial Services, Life Sciences, Manufacturing, and Retail. At least 4 years of expertise with accelerated computing architectures (GPUs, XPUs, DPUs), high-performance networking (InfiniBand, Ethernet), and modern storage/data platforms (e.g., NVMe-oF, Lustre, GPFS, BeeGFS, VAST, DDN, Weka) necessary for effective solution development. A minimum of 4 years in cluster management and orchestration using technologies such as Slurm, Run:ai, Kubernetes, Docker, alongside real-time performance monitoring frameworks. A minimum of 4 years of experience with cloud and virtualization platforms (e.g., AWS, Azure, GCP, VMware, Nutanix), with strong capabilities in automation and optimization through scripting (Python, AI tools) and foundational Infrastructure-as-Code tools like Terraform and Ansible. At least 4 years of experience in implementing MLOps and DevSecOps frameworks to create secure, automated, and reproducible workflows. A Bachelor's degree or equivalent experience (minimum of 12 years). Candidates with an Associate's Degree must have at least 6 years of relevant work experience. Preferred Skills and Qualifications: Experience managing large-scale deployments of GPU clusters (1,000+ GPUs) for demanding HPC and AI workloads. Familiarity with GPU computing libraries and accelerators (e.g., NVIDIA CUDA, Dynamo, AMD ROCm). Understanding of AI and HPC Networking concepts (e.g., RoCE, InfiniBand, multi-planar/multi-rail designs). Proficiency in Machine Learning and AI frameworks (e.g., TensorFlow, PyTorch, JAX), with experience using Jupyter notebooks and Google Colab environments. Experience employing optimization techniques for managing HPC & AI workloads effectively. Familiarity with DevOps strategies and tools (e.g., Ansible, Terraform) for automating infrastructure management processes. Industry-specific certifications related to NVIDIA infrastructure, public cloud services, or Data Science are advantageous. Compensation at Accenture varies based on location, role, skill set, and experience level. We accept applications on an ongoing basis with no set deadline for submission. For more information on benefits and accommodation options, please check Accenture's resources. Accenture is committed to equal employment opportunities and values diversity in the workforce. All employment decisions are made without discrimination. We promote innovation, competitiveness, and creativity fueled by our diverse team.
Created: 2026-03-04