Oracle America, Inc. | Principal Software Developer ...
Oracle America, Inc. - phoenix, AZ
Apply NowJob Description
This job was posted by : For more information, please see: Job Description Here at OCI were building the worlds largest AI clusters and were the fastest at bringing them to the market. The AI Infrastructure organization at OCI is leading this effort by creating a GPU focused cloud for AI workloads with the latest hardware. The AI workload organization is developing solutions to enable large AI customers to schedule their Kubernetes AI workloads on OCIs GPU cloud with the best performance, efficiency, reliability, and scalability. This is your chance to be part of the AI revolution, creating systems that allow customers to scale from tens to thousands of GPUs without compromising performance. You will have the opportunity to work with cutting-edge technologies and make a significant impact on our organization/'s success. We are looking for a highly skilled distributed systems engineer to optimize Kubernetes schedulers for AI workloads to increase GPU workload utilization and throughput. In this role, you will ensure top performance for AI workloads scheduled on our platform. You will provide technical leadership to the team and bring clarity to ambiguous problems and come up with innovative solutions that make it easy for our customers to deploy AI workloads on our GPU infrastructure. You will collaborate with cross-functional teams to enhance GPU control plane and GPU data plane to deliver exceptional customer experience. Career Level - IC4 Responsibilities Responsibilities Design and develop orchestration solutions to optimize Kubernetes schedulers for AI workloads to increase GPU workload utilization and throughput, and to ensure top performance for AI workloads scheduled on our platform. Develop best-in-class AI workload orchestration system for our customers by ensuring that the services and the components are well-defined and modularized, secure, reliable, diagnosable, actively monitored, compliant and reusable. Collaborate with cross-functional teams, including development, operations, and product management, to understand their requirements and design innovative orchestration solutions. Mentor junior developers and drive modern software engineering practices like leveraging data/telemetry to make decisions, well-defined interfaces across components, design reviews, coding standards, code reviews, and comprehensive coverage from unit test, integration test and active production monitoring. Develop benchmark metrics and automation to drive and track performance and reliability across customer workload and lower infrastructure stack.
Created: 2024-11-26