Lead Infrastructure Engineer
Yochana - Charlotte, NC
Apply NowJob Description
In this role, you will: • Lead complex initiatives to develop infrastructure to provide solutions for business applications • Architecting products to effectively utilize infrastructure platforms in a scalable, reliable manner • Debugging reliability and scalability issues across all stack layers, including the products built using our infrastructure platforms • Make monitoring and alerting alerts on symptoms and not on outages • Have an enthusiastic, go-for-it attitude. When you see something broken, you can't help but fix it • Have a desire to solve everyday challenges facing software engineers and automate their toil away • Have an excellent ability to manage multiple tasks and expectations at once • Participate in various projects intended to continually improve or upgrade the infrastructure • Evaluate internal and external software solutions which could be leveraged to meet target state architecture goals • Review and analyze high impact outages to ensure the proper processes and procedures are in place to avoid problems in the future • Design, build, deploy and maintain infrastructure solutions through collaborative efforts with the team and third-party vendors • Design, code, test, debug, and document programs using Agile development practices • Make decisions in technical designs, implementation plans and identify project risks and resource requirements • Direct the daily risk and control flow of operations, focusing on policies, procedures, and work standards to ensure success • Recommend courses of action to maintain cost effectiveness and achieve results • Collaborate and consult with peers, colleagues, and managers to resolve issues and achieve goals • Interact with customer and vendor • Lead small to medium cross-organizational transformational efforts in Platform space • Provide expertise in Kafka brokers, zookeepers, Kafka connect, schema registry, KSQL, Rest proxy and Kafka Control center • Use automation tools like provisioning using BladeLogic, Ansible, Chef, Jenkins and GitLab. • Deliver results in less defined & constantly changing environments • Communicate with broad and diverse audience, including technology and business leaders; ability to simplify complex messages for consumption • As an application support specialist position is responsible for leading support functions and driving the execution and maturity of multiple application support services including incident triage, root cause analysis, change evaluation-execution-validation, deployment management, and risk & vulnerability management. Works closely with development and infrastructure partners like middleware, NAS, database, network, etc. • Partner to influence and support innovation & continued drive towards automation, touch less operational sustainment as a design/architecture construct working with CIO technology partners/managers • Operational sustainment and reduce risks in the eco-system by aggressively pursuing safety and soundness type of actions not limited to vulnerability, patching, end of life and resiliency • Hands on engagement on all Production environment RunOps & DevOps support activities needed for the platform and applications • Drive operational management via Incident response, communication and tracking along with root cause identification and closure. • Manage and coordinate Production change requests and release management. • Provides operational continuity through the development, management, measurement, analysis and reporting of key service-level metrics as required by management • Sustained focus on driving continuous services improvements and innovation to design, implement and ensure SLAs, KPIs and OLAs for the critical business processes, applications, and partner interfaces • Regular presentation of Production performance and incident, root cause and preventative actions, and trend analysis to technical and business Management teams. • Maintain and update all Production related documentation (e.g., game plans, run books, procedures, processes). • Ensure effective Production systems monitoring, alarming and notification response/maintenance. • Provides general oversight and direction to virtual teams. Required Qualifications, US: • 5+ years of Technology Infrastructure Engineering and Solutions experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education • 5+ years of experience troubleshooting environments across the entire architecture (i.e., applications to infrastructure) • 3+ years of hands-on Linux administration experience Desired Qualifications: • 1+ years of experience in Artificial Intelligence, Natural Language Processing, Machine Learning, Distributed Computing, Chatbot, and Virtual Assistant • 1+ Years of experience supporting and monitoring Apache Flink solutions for real-time data processing • 1+ Years supporting and monitoring service load balancing architectures including F5, VMware AVI • 1+ years of experience with Big Data or Hadoop tools such as Spark, Hive, Kafka, and Map • Cloud Architect or Engineer Certification (i.e. GCP, Azure, AWS, etc.) • A BS/BA degree or higher in information technology • Competent working in one or more environments highly integrated with an operating system. • Have experience with VMWare Pivotal Cloud Foundry (PCF) and Tanzu Application Service (TAS) technologies • Have experience with Docker, OpenShift Container Platform (OCP), Kubernetes, Terraform, or similar IaC technologies • Have experience with MongoDB, Redis, Kafka, Postgres, or similar data technologies • Experience implementing and administering/managing technical solutions in major, large-scale system implementations. • High critical thinking skills to evaluate alternatives and present solutions that are consistent with business objectives and strategy. • Ability to lead projects/initiatives with high risk and complexity • Ability to manage to production goals/SLAs/SLOs/KPIs, deadlines, and operational metrics • Ability to manage tasks independently and take ownership of responsibilities • Ability to learn from mistakes and apply constructive feedback to improve performance • Ability to adapt to a rapidly changing environment. • Proven leadership abilities including effective knowledge sharing, conflict resolution, facilitation of open discussions, fairness and displaying appropriate levels of assertiveness. • Ability to communicate highly complex technical information clearly and articulately for all levels and audiences. • Willingness to learn new technologies/tool and train your peers. • Ability to identify root-cause issues, articulate improvement opportunities, and design approaches/programs/products to improve overall quality assurance • Strong knowledge of monitoring tools & their application (Glassbox, AppDynamics, Splunk, BigPanda AIOps, etc.) • Understanding of system performance and how load drives utilization and customer experiences. • Experience with Business Continuity Planning and Disaster Recovery, Application Resiliency/Highly Available Architecture, Site Resiliency • Knowledge and understanding of Conversational Artificial Intelligence, Machine Learning, Deep Learning, Linear Regression, Models
Created: 2026-03-04