Site Reliability Engineer
Compunnel, Inc. - Richmond, CA
Apply NowJob Description
The Site Reliability Engineer will be responsible for ensuring the reliability, availability, and performance of applications and services as part of the transition from private to public cloud. This role will involve driving the development of reliability engineering practices, automating processes, and enhancing system resilience to support our digital transformation journey in the competitive digital banking landscape.Key Responsibilities:Strategize and lead the transition from private to public cloud with a focus on reliability engineering.Ensure high availability, performance, and minimal downtime for applications and services.Lead incident response efforts, including triage, resolution, and post-incident analysis.Develop and maintain monitoring solutions, alerting mechanisms, and proactive issue detection.Implement automation tools to streamline routine tasks and ensure seamless deployments and rollbacks.Collaborate with development and operations teams for capacity planning, performance tuning, and scalability.Work with security teams to implement best practices and ensure compliance with regulatory requirements.Manage deployment pipelines, release processes, and configuration management for app services.Perform data analysis and trend analysis to identify areas for improvement in system reliability and operational efficiency.Maintain and document operational procedures, troubleshooting guides, and best practices.Develop and test disaster recovery plans and backup strategies to ensure business continuity.Collaborate with cross-functional teams to align on reliability goals and incident response processes.Participate in on-call rotations and provide 24/7 support for critical incidents.Required Qualifications:Proven experience in cloud reliability engineering, ideally in a public cloud environment (AWS, Azure).Strong knowledge of incident response, root cause analysis, and system resilience practices.Experience with automation tools for scaling infrastructure, deploying updates, and ensuring system reliability.Familiarity with monitoring tools (Dynatrace, Splunk, etc.) and ability to create dashboards and alerts.Excellent communication skills to collaborate effectively with cross-functional teams.Experience with security best practices, vulnerability assessments, and regulatory compliance in cloud environments.Ability to work in a fast-paced, high-pressure environment while maintaining a positive attitude.Experience with infrastructure modernization, cloud migrations, or microservices architecture.Preferred Qualifications:Experience with open telemetry collectors and monitoring platforms like Prometheus.Familiarity with DevOps tools and deployment pipelines.Knowledge of disaster recovery planning and business continuity strategies.Familiarity with GitHub-based infrastructure management and automation.Certifications:Relevant cloud certifications (AWS, Azure, or Google Cloud) are a plus. #J-18808-Ljbffr
Created: 2025-09-17