Senior Principal Software Engineer - AI Infrastructure
Oracle - Washington, DC
Apply NowJob Description
Job Description Join our dynamic GPU Availability and Monitoring team in the Compute Organization, where you'll play a pivotal role in shaping the architecture for GPU delivery, health monitoring, triage automation, and diagnostic services. Your contributions will be vital for managing distributed AI, ML, and HPC workloads across thousands of GPUs, utilizing cutting-edge technologies such as RoCE and Infiniband. We are seeking a highly skilled distributed systems engineer to design scalable and optimized Monitoring and Repair solutions for AI infrastructure components, including the GPU control plane and data plane. You will lead and guide the team in tackling ambiguous challenges and innovating effective solutions. Collaboration with cross-functional teams will be key in enhancing our AI infrastructure, ensuring exceptional customer experiences and optimal performance. Responsibilities Architect and design solutions to optimize Monitoring and Repair for GPU, CPU, Network, and Storage components, aiming to enhance customer experience and workload performance. Create a
Created: 2026-03-04