Monitoring and Logging SME
ARK Solutions, Inc. - atlanta, GA
Apply NowJob Description
Role: Monitoring and Logging SMELocation: 100% RemoteDuration: 12 month(s)+Monitoring and Logging Specialist (Subject Matter Expert) Note from Hiring Manager: This role is more focused on Monitoring and logging. We are looking for SME experience on Monitoring and logging {Loggin side more involve, when doing incident and looking into}; Automation and scripting. 1. Implement and Manage Monitoring Tools: Goal: Deploy and maintain a comprehensive monitoring system (e.g., Nagios, Zabbix, Prometheus) for infrastructure, applications, and network devices. Objective: Ensure all critical components (servers, network, cloud services) are being monitored with real-time alerts and dashboards. 2. Manage and Optimize Logging Platforms: Goal: Implement and optimize a logging solution (e.g., Splunk, ELK Stack, Graylog) to capture and store logs from various sources. Objective: Ensure log data is properly indexed, stored, and searchable for troubleshooting and analysis. 3. Develop Automated Alerting and Incident Response: Goal: Set up automated alerting rules and integrate them with incident management tools (e.g., PagerDuty, ServiceNow). Objective: Ensure the team is notified of incidents promptly, with relevant logs and metrics available for swift troubleshooting. 4. Ensure Compliance with Logging and Monitoring Best Practices: Goal: Enforce security, audit, and compliance standards in monitoringlogging solutions (e.g., ensure NIST, HIPAA compliance). Objective: Continuously audit logging practices to ensure logs are maintained securely, and compliance standards are upheld. 5. Drive Proactive Monitoring and Predictive Analytics: Goal: Implement proactive monitoring for system performance and use predictive analytics to identify potential issues before they occur. Objective: Reduce downtime and improve system reliability by predicting and resolving bottlenecks or potential failures. 6. Facilitate Cross-Team Collaboration for Incident Resolution: Goal: Enable effective collaboration between monitoring, infrastructure, and application teams during incidents by providing centralized monitoring data. Objective: Shorten mean time to resolution (MTTR) during outages or incidents by ensuring all teams have access to relevant monitoring and logging information. Goal: Train the team in using monitoring tools and ensure they understand how to read logs, set up alerts, and troubleshoot issues. 7. Train and Guide Team on Monitoring Tools: Objective: Equip the team to efficiently use and maintain the monitoring and logging systems, reducing dependency (no silos) on specialists for day-to-day operations. 8. Optimize System Resource Utilization: Goal: Regularly review and tune the monitoring and logging systems to ensure they are not overusing system resources (e.g., CPU, memory). Objective: Ensure the monitoring system itself does not become a bottleneck or contribute to performance degradation. 9. Integrate Monitoring Solutions with Cloud Platforms: Goal: Set up monitoring for cloud infrastructure and services (e.g., AWS, Azure) and integrate with existing tools. Objective: Ensure seamless monitoring of both on-premises and cloud infrastructure with unified dashboards and alerts. 10. Document MonitoringLogging Processes and Policies: Goal: Maintain detailed documentation of monitoring configurations, incident response protocols, and logging system architecture. Objective: Ensure the team can quickly onboard new members and continue operations smoothly in case of staff changes or system changes.
Created: 2024-10-18