Director, Site Reliability Engineering
GoTo Foods - Atlanta, GA
Apply NowJob Description
Join to apply for the Director, Site Reliability Engineering role at GoTo FoodsJoin to apply for the Director, Site Reliability Engineering role at GoTo FoodsGet AI-powered advice on this job and more exclusive features.Job SummaryWe are seeking a Director of Site Reliability Engineering (SRE) to lead and evolve our reliability, observability, and automation initiatives across cloud-native, multi-tenant systems. This role is critical in driving uptime, performance, and efficiency for our production and customer-facing environments while fostering a culture of continuous improvement and operational excellence.Job SummaryWe are seeking a Director of Site Reliability Engineering (SRE) to lead and evolve our reliability, observability, and automation initiatives across cloud-native, multi-tenant systems. This role is critical in driving uptime, performance, and efficiency for our production and customer-facing environments while fostering a culture of continuous improvement and operational excellence.Essential FunctionsEvolve a high-performing SRE team into a strategic, forward-leaning engineering force focused on innovation, automation, and measurable business impactDefine and drive an advanced SRE roadmap centered on self-healing systems, adaptive scaling, and platform resilienceAdvance existing SLAs, SLOs, and SLIs into predictive, business-aligned reliability models; formalize executive-level SLO reportingLead efforts to evolve observability into a proactive, AI/ML-driven capability for anomaly detection, early warning, and service health forecastingStrengthening incident response by integrating intelligent automation, enhancing runbooks, and refining on-call strategies for faster mitigationExpand chaos engineering and resilience testing practices across critical systems; institutionalize capacity stress testing and failover validationRefine CI/CD pipelines to support safe, high-frequency deployments with zero-touch rollback and dynamic environment provisioningInstitutionalize Infrastructure as Code (IaC) patterns to drive repeatable, auditable infrastructure operations at scaleOptimize FinOps practices with actionable insights into cost vs. performance tradeoffs and service-level ROIDrive deeper integration between SRE, Security, and Compliance for faster detection, triage, and resolution of security incidentsBalance system reliability and deployment velocity by analyzing error rates and stability indicatorsConduct Blameless Postmortems (BPM) for priority 1 incidentsProvide go-live leadership for high-stakes brand launches and system expansions on the NextGen platformPartner with architecture and product teams to embed observability, scalability, and cost awareness into solution designModernize disaster recovery operations to meet aggressive RTO/RPO objectives with fully automated failover mechanismsResolve technical debt, and avoid creating new technical debtOversee vendor performance, contract renewals, and third-party compliance across tooling and infrastructure partnershipsEnsure quarterly contractor audits, identity governance, and system access reviews are thorough and timelyCultivate a culture of continuous learning, experimentation, and innovation through coaching, advanced training, and stretch assignmentsDevelop continuous improvement framework based on agile retrospectives, SLIs, and service reviewsElevate the team's visibility and influence across the organization by aligning technical outcomes with business valueEducationBachelor’s Degree in Information Systems or related discipline; requiredWork ExperienceMinimum 10 years of experience in software development or information technologyMinimum 5 years working with cloud-native solutions, preferably with AzureMinimum 5 years of experience in DevOps and/or Site Reliability EngineeringMinimum 4 years of people management (hiring, mentoring, and managing engineering staff)Strong knowledge of Infrastructure as Code (IaC)Experience with pipeline based SDLC CI/CD automationExperience working on a scrum teamSkillsAbility to communicate complex, technical concepts to executive team, business leaders and franchisees.Ability to develop and maintain positive business relationships and foster an environment of mutual respect, understanding, trust, and support.Ability to coach employees in a positive manner.Ability to facilitate the resolution of different views.Ability to collect information from others without putting it in a defensive posture.Ability to adapt and adjust planned work through analyzing work demands, competing priorities, and tight deadlines; to understand the most effective and efficient means to accomplish tasks within the parameters of the organizational structure, processes, systems, and policies.Ability to exercise judgment and discretion in dealing with matters of significance and sensitive nature.Excellent organizational communication and leadership skills.Excellent analytical and problem-solving skills.Ability to develop, communicate and implement strategies and tactics.Strong business acumen and sense of urgency to achieve business results.CertificationsTravel RequirementNoneSeniority levelSeniority levelDirectorEmployment typeEmployment typeFull-timeJob functionJob functionEngineering and Information TechnologyIndustriesFood and Beverage ServicesReferrals increase your chances of interviewing at GoTo Foods by 2xSign in to set job alerts for “Site Engineer” roles.Atlanta, GA $90,000.00-$105,000.00 3 weeks agoAtlanta, GA $68,400.00-$92,000.00 2 weeks agoDirector of Residential Engineering - Civil Site DevelopmentEngineer - Embassy Suites by Hilton Atlanta BuckheadAtlanta, GA $65,000.00-$80,000.00 3 weeks agoWe’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI. #J-18808-Ljbffr
Created: 2025-10-08