Senior Site Reliability Engineer (SRE)

Green Threads, LLC - Washington, DC

Apply Now

Job Description

Green Threads LLC is currently seeking a Senior (Sr.) Site Reliability Engineer (SRE) to support our contract in the Washington Metropolitan Area. As part of our customer's goal to provide technical leadership, skills and solutions necessary to support Next Generation efforts. The Sr. SRE position will assist the team by leveraging their skills and experience to ensure reliability, availability, and performance of the enterprise services for the client in a high availability environment. This individual will work closely with the development and operations teams to build and maintain a scalable and robust infrastructure that supports the client's mission and goals. Responsibilities:Design and implement highly available and scalable systems, ensuring the reliability and performance of the companys website and applications in multi-cloud environment.Collaborate with cross-functional teams to define and establish service level objectives (SLOs) and service level agreements (SLAs) for critical systems.Participate in system design consulting, platform management, and capacity planning.Monitor systems and applications, proactively identifying and resolving any performance bottlenecks or availability issues.Develop and maintain monitoring tools, alerts, and dashboards to provide visibility into system health and performance across systems deployed in AWS, GCP and Azure cloud providers.Develop and maintain automation scripts, configuration management tools, and infrastructure as code (IaC) templates to automate deployment, scaling, and monitoring tasks across multiple cloud platforms.Develop and implement guidelines for provisioning, configuring, and optimizing cloud resources to meet performance, scalability, and cost requirements.Conduct post-incident analyses to identify root causes and implement preventive measures to avoid future incidents.Gather and analyze metrics from operating systems as well as applications to assist in performance tuning and fault finding.Create and maintain documentation for system architecture, configuration, and troubleshooting procedures.Perform capacity planning and resource allocation to ensure optimal system performance and scalability.Collaborate with development teams to implement and deploy new features and enhancements, ensuring they meet reliability and performance standards.Stay up to date with industry best practices, new technologies, and emerging trends in site reliability engineering across major cloud service providers.Required Skills and Qualifications:Strong knowledge of Linux/Unix and Windows systems and command line tools.Must have proficiency in scripting languages such as Python, Java Script, Shell, or Perl.Experience with configuration management tools like Ansible, Puppet, or Chef.Familiarity with multiple cloud platforms AWS, Azure, and/or Google Cloud.In-depth understanding and expertise with native cloud tools and solutionsDeep understanding of the cloud infrastructure provided by various providers, such as AWS, Azure, and GCP.Understanding of networking principles and protocols (TCP/IP, HTTP, DNS, etc.).Knowledge of containerization technologies (Docker, Kubernetes) and orchestration tools.Expertise in monitoring and logging tools such as Prometheus, Grafana, ELK stack, or Splunk.Strong problem-solving and troubleshooting skills, with the ability to analyze and resolve complex technical issues.Excellent communication and collaboration skills to work effectively with cross-functional teams.Strong attention to detail and ability to work in a fast-paced, dynamic environment.Preferred Qualifications:Proven experience as a Site Reliability Engineer or a similar role.Solid understanding of software development methodologies and DevOps principles.Experience with agile and iterative development processes.Certification in relevant technologies or frameworks is a plus (e.g., AWS Certified DevOps Engineer, Certified Kubernetes Administrator).Familiarity with continuous integration/continuous deployment (CI/CD) pipelines.Experience with source control systems such as Git.Knowledge of security best practices and experience implementing security measures in a production environment.Ability to work independently and handle multiple projects and priorities simultaneously.Strong analytical and problem-solving skills, with a focus on continuous improvement and automation.Education:Bachelors degree in Computer Science, Engineering, or related field (or equivalent work experience).Benefits:Competitive WagesHealth, Dental and Vision Plans401(k) Program with Company MatchProfit SharingPaid VacationPersonal / Sick PayTuition and Training Reimbursement

Created: 2025-03-09

➤

Login

Create Account

Senior Site Reliability Engineer (SRE)