Observability Monitoring Engineer
TEKsystems - jersey city, NJ
Apply NowJob Description
(CANNOT WORK C2C) Must work W2 Candidates must be willing to work onsite 3 days a week. No exceptions Job Description: We are seeking a highly skilled Observability Monitoring Engineer with expert knowledge in Prometheus, Grafana, or Git. This role involves developing and managing telemetry for large-scale datasets and implementing strategies to enhance AI system reliability and performance, as well as assisting in capacity management. Key Responsibilities: Develop and manage telemetry systems for large-scale datasets. Implement monitoring and alerting solutions to ensure system reliability. Collect and analyze data to improve AI system performance. Automate processes to enhance efficiency and reduce manual intervention. Manage and maintain Kubernetes clusters and Docker containers. Utilize Prometheus and Grafana for monitoring and visualization. Work with DCGM/DCGM Exporter (Nvidia Stack) for telemetry. Collaborate with data scientists to support AI/ML platforms. Troubleshoot and resolve issues related to telemetry systems. Primary Skills: Telemetry/Observability, Monitoring and Alerting, Data Collection and Analysis, Automation Prometheus and Grafana JSON/YAML Kubernetes and Docker/Container Technologies DCGM/DCGM Exporter (Nvidia Stack) Solid understanding of telemetry concepts, metrics, logs, and tracing Benefits: Eligibility requirements apply to some benefits and may depend on your job classification and length of employment. B Benefits are subject to change and may be subject to specific elections, plan, or program terms. If eligible, the benefits available for this temporary role may include the following: Medical, dental & vision Critical Illness, Accident, and Hospital 401(k) Retirement Plan - Pre-tax and Roth post-tax contributions available Life Insurance (Voluntary Life & AD&D for the employee and dependents) Short and long-term disability Health Spending Account (HSA) Transportation benefits Employee Assistance Program Time Off/Leave (PTO, Vacation or Sick Leave) About TEKsystems: We're partners in transformation. We help clients activate ideas and solutions to take advantage of a new world of opportunity. We are a team of 80,000 strong, working with over 6,000 clients, including 80% of the Fortune 500, across North America, Europe and Asia. As an industry leader in Full-Stack Technology Services, Talent Services, and real-world application, we work with progressive leaders to drive change. That's the power of true partnership. TEKsystems is an Allegis Group company. The company is an equal opportunity employer and will consider all applications without regards to race, sex, age, color, religion, national origin, veteran status, disability, sexual orientation, gender identity, genetic information or any characteristic protected by law.
Created: 2024-11-20