Site Reliability Engineer
Tata Consultancy Services - atlanta, GA
Apply NowJob Description
Automating work including infrastructure needs, testing, failover solutions, failure mitigation, and much more Debugging complex problems across an entire stack and creating solid solutions Developing and building CI/CD processes to improve cadence Using Chaos Engineering to test what you build under real-world conditions Triage product or system issues and debug/track/resolve by analyzing the sources of issues and the impact on hardware, network, or service operations and quality. Participate in, or lead design reviews with peers and stakeholders to decide amongst available technologies. Experience with an APM tool such as Dynatrace, New Relic, AppDynamics, or Datadog. Performance Measurement and Tuning: Knowledge of system performance, testing and programming; ability to monitor, measure, and optimize system performance and network communication. Site Reliability Engineering: Knowledge of the theories and methodologies of reliability engineering; ability to design, develop and support various tools, services and applications to maintain a reliable site environment. Support capacity planning, availability, scalability, security and latency considerations for new infrastructure and service provisioning as appropriate Responsible for improvements to end-to-end availability and performance of mission critical services and build automation to prevent problem recurrence. Strong experience setting SLOs / SLIs / error budgets and managing of reliability for infrastructure and applications Partner with other SREs to bring best practices or learnings from across the organization to them Scale and optimize existing infrastructure and services sustainably through mechanisms, including automation, and evolve them by improving reliability and efficiency Manage end-to-end availability and performance of mission-critical services and build automation to prevent problem recurrence Maintain infrastructure and services by measuring, and monitoring system metrics to proactively identify operational efficiencies, potential outages and security threats in Development, UAT, Staging and Production environments Practice sustainable incident response and blameless postmortems Develop and maintain solution and operational documentation and designs for all infrastructure and services within the scope of SRE O ther Skills AWS SysOps Administrator OR AWS DevOps Engineer certification Experience with Akamai or related WAF application preferred. Experience with OpenShift, Kubernetes. Experience with setting up synthetic monitors and tracking SLAs. Experience with airline applications and infrastructure technology is a plus. Experience developing applications and/or automation runn ing in Red Hat OpenShift is a plus.
Created: 2024-10-15