



Especially for applications such as recommender systems, it is difficult to build, deploy and manage an end-to-end ML pipeline in a production environment. To operate an ML model and provide business value within an enterprise application, you not only develop machine learning algorithms and the model itself, but also collect and prepare data, build the model, and train or retrain with new data. , Requires a continuous process of model validation and inference. Provides and monitors model performance to ensure relevance of results.

In addition to the challenges of pipeline development, you also need to accelerate these steps to protect and manage the right computing infrastructure to achieve customer-guaranteed quality of service (QoS). Also, because each step in the pipeline is unique, the computational requirements for data preparation and training can be quite different from those required to handle multiple different inference requests. This is a challenge for both development and infrastructure management, commonly referred to as the MLOps challenge.

Google Cloud and NVIDIA are working together to simplify MLOps by integrating solution elements and building, delivering, and dynamically scaling end-to-end ML pipelines with properly sized GPU acceleration. Made it powerful and cost effective. You can focus on delivering the highest value to your end customers while maximizing infrastructure utilization and minimizing operational costs for deploying AI-enabled services.

GKE + MIG = MLOps portability, scalability, productivity

Google Kubernetes Engine (GKE) is a managed environment for deploying, scaling, and managing containerized ML applications on a secure Google infrastructure. GKE facilitates cluster creation, load distribution, calculation requirements for demand-based autoscaling, and more. Most importantly, GKE eliminates the need for users to manage their own workstations, servers and VMs when building and deploying ML pipelines. Focus on the most important value-added tasks of building and training ML models for business use cases.

Google Kubernetes Engine (GKE) now supports multi-instance GPU (MIG) capabilities, splitting each NVIDIA A100 Tensor core GPU in a new A2VM instance into up to seven independent GPU instances, each with its own high bandwidth. It is now possible. Memory, cache, and computing core. GKE more finely provisions GPU resources for workloads, shares a single GPU for multi-user, multi-model use cases, and automatically based on the changing needs of the ML pipeline. You can scale up or down.

Figure 2. Multiple AI inference requests on a single NVIDIA A100 GPU with an NVIDIA Triton inference server and GKE

For example, GKE can provision multiple A100 GPU MIG instances to handle inference requests for multiple models running simultaneously on independent MIG partitions within a single A100 GPU to maximize utilization. As the deployed ML pipeline requires more computation (for example, a sudden surge in inference requests to the service), GKE can automatically scale to an additional node pool with MIG partitions. In addition, the NVIDIA Collective Communication Library (NCCL) further optimizes multi-GPU, multi-node communication within the GKE cluster to ensure high bandwidth, high throughput, and low latency.

NVIDIA solution stack for developing and deploying end-to-end machine learning pipelines

To develop a scalable, GPU-accelerated, end-to-end ML application pipeline that maximizes the benefits of MIG capabilities for leveraging GPUs on Google Cloud. It provides some application-specific frameworks. End recommendation systems, NVIDIA Jarvis for multimodal conversational AI services, and NVIDIA RAPIDS for data analysis pipelines. All NVIDIA-optimized frameworks, SDKs, pre-trained models, and performance-optimized libraries are accessible from the NGC catalog, the hub of GPU acceleration software.

Large-scale production deployment of ML pipelines in GKE managed clusters is further simplified by the NVIDIA Triton Inference Server software. With this open source inference service software, teams can use any framework (TensorFlow, TensorRT, PyTorch, ONNX runtime, or custom frame) from local storage or Google Clouds managed storage products on a GPU or CPU-based infrastructure. You can deploy a trained AI model from Work). The Triton Inference Server software is now available directly on the GCP Marketplace, allowing you to seamlessly deploy, deliver, and monitor performance, and dynamically scale multiple AI inference requests in your MIG-enabled GKE cluster.

Put everything together

GKE manages Kubernetes services and combines the flexibility of A100MIG capabilities with the NVIDIA GPU-optimized solution stack to accelerate ML pipelines to commercialize end-to-end ML pipelines for development and infrastructure management. You can address both challenges.

Check out this GTC21 session – “Get a competitive advantage with MLOps: Kubeflow, NVIDIA Merlin and Google Cloud” to see if these technologies are working in real life. Learn how you can achieve a GPU-optimized solution stack for, GKE, NVIDIA A100 MIG, and NVIDIA. Used to build and deploy end-to-end recommender systems.

