Staff Software Engineer - Site Reliability

Intuit Inc. - San Diego, CA

Apply Now

Job Description

OverviewCome join the Identity Team as Site Reliability / DevOps Engineer (System Engineering). Identity is at the heart of all offerings across Intuit and is foundational to strategic transformation of Intuit. Identity at Intuit is one of the most critical services powering close to 500+ applications/services and enables Intuit’s 3 strategic big bets. Identity capabilities position Intuit at the center of the financial ecosystem and enable fluid exchange of Identity, profile and data across an ecosystem of financial institutions. Identity's technical stack is cloud native microservices based architecture fully operating on Kubernetes & AWS cloud.Identity is accountable for authentication, authorization and identity lifecycle management domains and is delivered through platform capabilities and pluggable experiences across mobile and web. Platform SREs operate right at the intersection of Software Engineering and Infrastructure Engineering to build and operate large scale systems that are secure, fault-tolerant, performant, highly available, affordable, and scalable. The Identity Systems Engineering team uses industry best practices, tools, and principles from software engineering, architecture, and security to solve operational challenges.ResponsibilitiesAct as the technical subject matter expert to evaluate and evangelize forward-looking processes, tools technologies and architecture to help deliver high-quality secure software faster and more efficiently while meeting availability, scale & performance requirements in a AWS public cloud and Kubernetes environment.Actively evolve the system / infrastructure target state working with a cross-functional team from Architecture, Product Management, and Production Operations.Be a part of the roadmap and strategy for the Operational Excellence, Resiliency and Cost Optimization charters for Identity platform capabilities.Design and develop self-recovery mechanisms and tools for massive scale platforms to enable faster and automatic recovery.Design and develop observability components for massive scale platforms, to detect issues quickly and isolate the problem as part of fast recovery.Contribute to the cost and capacity management for platform components, uncovering cost saving opportunities and developing automation to enforce them.Build self-service tools to enable platform consumers to troubleshoot and triage issues in a scalable manner.Contribute to FMEA (Failure Mode Effective Analysis) and Chaos Engineering for critical platform components, identifying resiliency gaps and preparing the team for faster recovery from production incidents.Continuously evolve development practices and operational maturity through structured root cause analysis and monitoring. Drive and own Root Cause Analysis (RCA) for specific applications.Troubleshooting complex issues and managing stakeholders' expectations during incidents.Participate in 12/7 on-call rotations.Supporting and coaching other engineers, pair programming or peer reviewing code, helping to ensure that all engineers are growing and part of a community. Be a role model to engineers and inspire a high technical bar for the teamQualificationsBS/MS in computer science, engineering or equivalent work experience10+ years of experience in developing and operating complex distributed software systems in an enterprise cloud native environment (AWS preferred).Strong AWS development and deployment knowledge, GCP a plus.Demonstrated experience operating high scale and high availability services in the cloud. Demonstrated experience in designing highly resilient services and building recovery mechanisms.Experience using AI to solve complex operational and auto healing problems.Developed infrastructure as code (Terraform/CDK preferred), CI/CD pipelines using Jenkins, Circle CI, Cloud Builder, Docker, Kubernetes, ECSCoding in Python, Java, Go or other similar languages combined with strong operational skillsMonitoring & Alerting tools such as Splunk, Wavefront, Grafana Mimir.Ability to handle a fast-paced environment for iterative project turnarounds on mission critical systems.Ability to collaborate across a wide range of roles and experience levels. Strong communication skills.Solid Linux/Unix skills #J-18808-Ljbffr

Created: 2025-09-18

➤

Login

Create Account

Staff Software Engineer - Site Reliability