MTS, MLOps
Acceler8 Talent - jersey city, NJ
Apply NowJob Description
Who we are: Re-founded in March of 2024, they have assembled a team of kind, innovative, and collaborative professionals dedicated to shaping the future of enterprise AI. We celebrate diverse ideas and approaches while solving some of the most challenging problems in the industry.Their flagship product is an empathetic conversational chatbot built on our 350B+ frontier model and refined through sophisticated fine-tuning, inference, and orchestration techniques. As we scale our solutions, our infrastructure must evolve to meet the rigorous demands of production environments.About the Role As a Member of Technical Staff on their MLOps ML Infrastructure team, you will be at the core of designing, building, and operating the systems that power our machine learning workflows"”from model training to production deployment. Your work will be critical in developing control planes and robust tools around ML services, ensuring our platform is scalable, secure, and resilient. We're looking for candidates with production operations experience, strong open-source backgrounds, and hands-on expertise in managing distributed clusters.This is a good role for you if you:Have extensive experience operating ML systems in production environments and building tools to manage them.Are highly proficient in managing distributed clusters using Kubernetes (K8s), SLURM, and RayPossess a strong open-source background, with experience at top-tier companies, and are comfortable leveraging community-driven tools as well as proprietary solutions.Are security-aware and knowledgeable about best practices for safeguarding production systems, even though the role is not exclusively security-focused.Thrive in dynamic, innovative environments where pushing the boundaries of ML infrastructure is a daily pursuit.Responsibilities include:Designing and implementing scalable ML infrastructure to support end-to-end machine learning workflows"”from training and deployment to production operations.Building and managing control planes and tools that ensure efficient, secure, and reliable operation of our ML services.Collaborating closely with ML researchers, data scientists, and engineers to optimize system performance and resource utilization.Leveraging distributed computing frameworks (Kubernetes, SLURM, Ray) to orchestrate ML workloads across diverse environments.Continuously evaluating and integrating emerging technologies to enhance scalability, efficiency, and security across our ML systems.Maintaining a strong security posture through the adoption of best practices in infrastructure design and operations.
Created: 2025-02-22