Monitoring and Alerting Engineer
Talent Groups - Fort Worth, TX
Apply NowJob Description
Job Title: Monitoring and Alerting EngineerLocation: Fort Worth, TXJob Type: Contract (12 months)Experience Required: 10+ yearsAbout the Role: We are seeking a highly skilled Monitoring and Alerting Engineer with over 10 years of experience to manage and optimize the monitoring and alerting systems for our IT infrastructure. This role focuses on ensuring the availability, reliability, and performance of critical systems and applications by proactively identifying and addressing potential issues before they impact business operations.Key Responsibilities:System Monitoring: Implement and manage monitoring solutions for the performance, health, and availability of IT systems, networks, and applications.Alert Management: Configure and handle alerting systems to ensure timely notifications of any issues.Incident Response: Collaborate with support teams to resolve incidents and outages swiftly.Root Cause Analysis: Investigate incidents, determine the root cause, and implement corrective actions.Optimization: Use data analysis to identify opportunities for system optimization and performance improvements.Tool Evaluation: Evaluate, recommend, and integrate monitoring and alerting tools to improve system efficiency.Documentation & Reporting: Maintain thorough documentation, including configurations, incident reports, and performance metrics.Collaboration: Work closely with internal IT teams and external vendors to ensure seamless operations.Skills & Qualifications:Proficiency with monitoring tools (e.g., Dynatrace, Datadog, CloudWatch, Splunk)Strong understanding of IT infrastructure (servers, networks, cloud environments)Experience with incident, problem, and change management processesStrong troubleshooting and analytical skillsEffective communication and collaboration with various IT teamsFamiliarity with ITIL best practices and service management frameworksPerformance Expectations:Ability to work in a 7x24 environment with on-call support as needed.Lead event resolution processes for mission-critical IT and Telecom systems.Monitor systems for performance issues and optimization opportunities.Participate in major incident response, escalate critical events when necessary.Conduct root cause analysis and identify chronic system issues.Collaborate with senior management to address critical business-impacting events.Qualifications:Hands-on experience with tools like Dynatrace, AppMon, Zabbix, SCOM, Datadog, CloudWatch, X-Ray, and Splunk.Self-motivated and capable of managing critical incidents in a 24/7 environment.Experience managing high-priority system outages and interacting with cross-functional teams.Availability for after-hours support on a rotational basis.Preferred Qualifications:Bachelor's degree in Computer Science, Information Systems, or related field.Expertise in distributed systems, administration, and scripting/programming (Python, Node.js, Ruby, Perl, Bash).ServiceNow experience.Strong written and communication skills
Created: 2025-02-22