Sr. Site Reliability Engineer
Optomi - Charlotte, NC
Apply NowJob Description
Site Reliability Engineer | Hybrid in Charlotte, NC & Detroit, MIOptomi, in partnership with a leading enterprise financial organization, is seeking an experienced Site Reliability Engineering to join their team in a high-impact production environment! This role will be responsible for driving reliability, scalability, and performance across complex, distributed systems while partnering closely with engineering, product, and infrastructure teams. The ideal candidate will play a key role in shaping observability strategy, improving system resilience, and leading operational excellence initiatives across mission-critical platforms.Experience of the right candidate:Experience as a Site Reliability Engineer or in a similar production-facing roleExperience managing SRE or DevOps teams, including observability-focused workloadsAgile delivery experience owning roadmaps and executing work using Scrum or Kanban methodologiesStrong background in AWS and cloud architecture, including services such as ASG, Lambda, Fargate, Aurora DB, DynamoDB, and ALB/NLBHands-on experience with CI/CD pipelines (GitLab) and Infrastructure as Code tools (Terraform, Python, Ansible, etc.)Experience working across Linux and Windows environments with exposure to Java, Spring/Spring Boot, REST APIs, microservices, shell scripting, Python, PL/SQL, and Oracle databasesStrong knowledge of observability platforms such as Splunk and DynatraceExperience designing and implementing observability solutions for enterprise-scale applicationsSolid understanding of system administration and DevSecOps principlesExperience working with SLI, SLO, and SLA development in collaboration with product, business, and engineering teamsStrong communication skills with the ability to present to technical and non-technical stakeholdersResponsibilities of the right candidate:Collaborate with cross-functional teams to design, build, and maintain scalable, fault-tolerant systemsAdvocate for reliability best practices throughout the application development lifecycleDesign and implement monitoring, alerting, and observability solutions to ensure real-time visibility into system healthMonitor system performance, proactively identify issues, and implement long-term stability improvementsDevelop automation tools and processes to reduce manual operational workloadLead incident response efforts, including coordination, triage calls, and post-incident reviewsPerform capacity planning and resource optimization to support growth and scalability demandsContinuously evaluate and implement emerging technologies to improve system reliability and efficiencyPartner with stakeholders to define and maintain SLIs, SLOs, and SLAsEnsure adherence to security, reliability, and operational best practices across environments
Created: 2026-05-09