SRE Team Lead - Wellsfargo
Maintec Technologies - charlotte, NC
Apply NowJob Description
Job Description - SRE Team Lead Location: USA Charlotte (Alternate Location less preferred NY, St. Louis, MO) Who are we looking for? As a Site Reliability Engineer (SRE), you'll help build & maintain valuable engineering discipline, combining software and systems to develop engineering solutions to operations problems. Our support and software development focuses on improving and optimizing existing systems, building infrastructure and reducing work through automation. You'll join a team of curious mind to solve business problems. In this environment, you'll take the lead on relevant task as independent contributor that can resolve the day to day business challenge. As an SRE, you'll be focused on running better production applications and systems and improve as on-going objective. Your responsibilities: Responsible for Service Level Agreements (SLA), Service Level Objectives (SLO), and associated metrics associated with the Critical Java applications deployed in Cloud Provide Cloud operations management, Cloud services deployments Ensure Cloud Services availability Strong knowledge of automation and scripting language (Python) Work with SRE team to maintain integrity of cloud services deployments Responsible of CI/CD pipelines across platform and applications Proactive monitoring, analysis, remediation, and action as needed Responsible for managing incidents, problems, change management, release management, analytics on previous incidents and usage patterns Manage new development, new enhancement and operationalize the changes Manage the On Call staffing plan, roster, allocation of team members, internal and external communication and reporting Plan and implement patching and upgrades Analyze system health metrics Enforce best practices security, reliability, resiliency, self-healing, HA, automation and quality of service Establish and follows SRE Principles Coordinate and manage the operational schedules and priorities Infrastructure Monitoring and Reports for all performance metrics Technical Skills: 12+ years overall experience with 5+ years in SRE Technical Manager role handling IaaS, PaaS and Microservices on PCF / Azure 4+ experience as SRE Engineer in DevOps, DataOps, SecOps or InfraOps 2+ experience as Level 1, 1.5 or 2 support / operations with 24x7 support across onsite/offshore/nearshore model Experience managing a large global cloud organization working in multiple locations and time zones. Brings the best of the industry and the organization along in the journey Good knowledge of Information Technology Infrastructure Library processes Experience managing SLI, SLO, Toil management, Error budget and metrics Experience in cloud reliability standards, observability, security, performance, disaster recovery and reporting requirements Experience with identifying Manual, repetitive, automatable task and automate them Experience with IT and Cloud security standards and compliance Hand on Experience working on Java, PCF or Azure Platforms Hand on Experience in working Azure AD Hand on Experience in automation and scripting using Python Strong expertise in Cloud concepts like Infrastructure as Code, Cloud Computing, Cloud Networking, Cloud Storage & Backup, Containerization, SSO, sFTP, and SRE Experience in understanding and implementing SecOps needs Experience in release, deployment of patches across the spectrum of scope Process Skills: Having sound knowledge of ITIL practices like Change Management, Incident Management, Problem management, release management etc. Exceptional communication skills Self-starter, ambitious, willing to take on difficult problems Collaborative, team player attitude Practical exposure & knowledge in existing / emerging cloud Database technologies. Has worked in Metrix role with an ability to work independently with multiple managers with dotted line hierarchies. Keeping abreast of industry trends, technology innovation, and changing customer requirements to help with the continual service improvement process. Participate in on-call rotations and be responsible for infrastructure and platform level escalations. Work with the DevOps team on planning and implementation of infrastructure capacity planning, upgrades, and monitoring. Participate in Daily (Standup) Production Reviews Contribute to the design and improvement of deployment architecture of new and existing applications based on the principles of reliability, high availability, efficiency, and observability. Research, learn, adapt, customize, and create tools to improve the observability, resilience, and usability of applications in scope Create and maintain SRE-related documentation (solution repository, Root Cause Analysis Reports etc) Certification: Certification in PCF, Java mandatory
Created: 2024-10-19