Staff Software Engineer - Site Reliability
Intuit - San Diego, CA
Apply NowJob Description
OverviewCome join the Identity Team as Site Reliability / DevOps Engineer (System Engineering). Identity is at the heart of all offerings across Intuit and is foundational to strategic transformation of Intuit. Identity at Intuit is one of the most critical services powering close to 500+ applications/services and enables Intuit’s 3 strategic big bets. Identity capabilities position Intuit at the center of the financial ecosystem and enable fluid exchange of Identity, profile and data across an ecosystem of financial institutions. Identity's technical stack is cloud native microservices based architecture fully operating on Kubernetes & AWS cloud.What you'll bringBS/MS in computer science, engineering or equivalent work experience10+ years of experience in developing and operating complex distributed software systems in an enterprise cloud native environment (AWS preferred).Strong AWS development and deployment knowledge, GCP a plus.Demonstrated experience operating high scale and high availability services in the cloud. Demonstrated experience in designing highly resilient services and building recovery mechanisms.Experience using AI to solve complex operational and auto healing problems.Developed infrastructure as code (Terraform/CDK preferred), CI/CD pipelines using Jenkins, Circle CI, Cloud Builder, Docker, Kubernetes, ECSCoding in Python, Java, Go or other similar languages combined with strong operational skillsMonitoring & Alerting tools such as Splunk, Wavefront, Grafana MimirAbility to handle a fast-paced environment for iterative project turnarounds on mission critical systemsAbility to collaborate across a wide range of roles and experience levels. Strong communication skillsSolid Linux/Unix skillsHow you will leadAct as the technical subject matter expert to evaluate and evangelize forward-looking processes, tools technologies and architecture to help deliver high-quality secure software faster and more efficiently while meeting availability, scale & performance requirements in a AWS public cloud and Kubernetes environment.Actively evolve the system / infrastructure target state working with a cross-functional team from Architecture, Product Management, and Production Operations.Be a part of the roadmap and strategy for the Operational Excellence, Resiliency and Cost Optimization charters for Identity platform capabilities.Design and develop self-recovery mechanisms and tools for massive scale platforms to enable faster and automatic recovery.Design and develop observability components for massive scale platforms, to detect issues quickly and isolate the problem as part of fast recovery.Contribute to the cost and capacity management for platform components, uncovering cost saving opportunities and developing automation to enforce them.Build self-service tools to enable platform consumers to troubleshoot and triage issues in a scalable manner.Contribute to FMEA (Failure Mode Effective Analysis) and Chaos Engineering for critical platform components, identifying resiliency gaps and preparing the team for faster recovery from production incidents.Continuously evolve development practices and operational maturity through structured root cause analysis and monitoring. Drive and own Root Cause Analysis (RCA) for specific applications.Troubleshooting complex issues and managing stakeholders/' expectations during incidents.Participate in 12/7 on-call rotations.Supporting and coaching other engineers, pair programming or peer reviewing code, helping to ensure that all engineers are growing and part of a community. Be a role model to engineers and inspire a high technical bar for the teamSeniority levelMid-Senior levelEmployment typeFull-timeJob functionEngineering and Information TechnologyIndustriesSoftware DevelopmentReferrals increase your chances of interviewing at Intuit by 2xGet notified about new Staff Software Engineer jobs in San Diego, CA. #J-18808-Ljbffr
Created: 2025-09-21