Site Reliability Engineer
TRUGlobal - Santa Clara, CA
Apply NowJob Description
Role Summary:We are looking for a skilled Site Reliability Engineer (SRE) with strong expertise in Python coding to join our dynamic team. As an SRE, you will be responsible for ensuring the reliability, scalability, and performance of our systems while also contributing to automation, monitoring, and incident response. Your Python coding skills will play a pivotal role in building and maintaining tools, scripts, and frameworks to optimize infrastructure operations.Key Responsibilities:System Reliability and Performance:Monitor, maintain, and improve the reliability and availability of critical systems.Develop and implement strategies to enhance system performance and reduce latency.Perform capacity planning and optimize resource utilization across platforms.Automation and Tooling:Build and maintain automation tools and scripts using Python to streamline operational workflows.Develop CI/CD pipelines and infrastructure-as-code (IaC) solutions.Automate incident detection, escalation, and resolution to minimize downtime.Incident Management and Troubleshooting:Respond to and resolve incidents, ensuring minimal impact to system availability.Conduct root cause analysis and implement solutions to prevent recurrence.Create and maintain playbooks for incident response and recovery.Monitoring and Observability:Design and implement robust monitoring solutions for systems and services.Set up metrics, logs, and dashboards to gain visibility into system health.Optimize alerting mechanisms to balance early warnings with noise reduction.Collaboration and Best Practices:Partner with development teams to design scalable, resilient systems.Advocate for and implement SRE best practices, such as error budgets and SLAs/SLOs.Provide technical guidance to engineers on improving code quality and system reliability.Key Qualifications:Experience:3+ years of experience in Site Reliability Engineering, DevOps, or related roles.Proven expertise in Python scripting and development for infrastructure tasks.Technical Skills:Strong programming skills in Python, with the ability to write clean, modular, and testable code.Hands-on experience with Linux/Unix systems, including shell scripting.Proficiency in cloud platforms (e.g., AWS, GCP, Azure) and container orchestration (e.g., Kubernetes).Experience with configuration management tools (e.g., Terraform, Ansible, or Chef).Knowledge of monitoring tools (e.g., Prometheus, Grafana, DataDog) and log management (e.g., ELK stack).Problem-Solving Skills:Strong analytical and debugging skills to resolve system-level issues.Ability to troubleshoot distributed systems and understand complex architectures.Soft Skills:Excellent communication skills and the ability to work collaboratively across teams.A mindset focused on continuous improvement and proactive problem resolution.Preferred Qualifications:Familiarity with Go or other programming languages is a plus.Knowledge of security best practices for infrastructure and applications.Experience with hybrid cloud environments and multi-cloud strategies.Background in implementing chaos engineering or resilience testing.
Created: 2025-01-31