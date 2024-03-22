



When developers are innovating rapidly, security can take a backseat. This also applies to AI/ML workloads, which pose a higher risk for organizations looking to protect their valuable models and data.

When you deploy your AI workloads on Google Kubernetes Engine (GKE), you can benefit from the many security tools available on Google Cloud infrastructure. In this blog, we share security insights and hardening techniques for training AI/ML workloads on one framework, specifically Ray.

Ray needs more security

Ray has grown in popularity as a distributed computing framework for AI applications in recent years, with deployment on GKE becoming a popular choice for its flexibility and configurable orchestration. Read more about why we recommend Ray for his GKE here.

However, Ray does not have built-in authentication and authorization, so if a request is successfully sent to the Ray cluster head, arbitrary code will be executed on the user's behalf.

So how do you secure a lei? The authors say security needs to be hardened outside of the Ray cluster, but how do you actually harden it? When Ray runs on GKE, it uses Identity-Aware Proxy (IAP) You can leverage existing global Google infrastructure components, such as , for more secure, scalable, and reliable Ray deployments.

We've also made progress in the Ray community to create more secure defaults for running Ray on Kubernetes using KubeRay. One area of ​​focus is improving Ray component compliance with the Restricted Pod Security Standard Profile and adding security best practices such as running operators as non-root to prevent privilege escalation.

Supports multi-cluster operation with security separation

One of the key benefits of running Ray within Kubernetes is that you can run multiple Ray clusters with diverse workloads, managed by multiple teams, within a single Kubernetes cluster. This allows nodes with accelerators to be used by multiple teams, improving resource sharing and utilization, and by spinning up Ray on his existing GKE cluster, he can start running his workloads. This saves time waiting for VM provisioning.

Security plays a supporting role in achieving the benefits of multi-cluster by isolating Ray clusters using Kubernetes security features. The goal is to avoid accidental denial of service or accidental cross-tenant access. Note that security isolation here is not hard multi-tenancy, and is only sufficient for clusters running trusted code and teams that trust each other with data. If you need more isolation, consider using a separate GKE cluster.

The architecture is shown in the following diagram. Different Ray clusters are separated by namespaces within his GKE cluster, so authorized users can call their assigned Ray cluster without accessing other clusters.

