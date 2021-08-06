



The release of Databricks on Google Cloud Platform (GCP) was a major milestone for a truly multi-cloud integrated data, analytics, and AI platform. GCP’s Databricks is a co-developed service that allows you to store all your data on a simple, open lakehouse platform, based on a standard container that runs on Google’s Kubernetes Engine (GKE).

When I released Databricks on GCP, the feedback was “successful”. However, some have asked deeper questions about Databricks and Kubernetes, so we decided to share the reasons for using GKE, what we learned, and some important implementation details.

Why Google Kubernetes Engine?Open source software and containers

At Databricks, open source is at the heart of us. That’s why we continue to create and contribute to major open source projects such as Apache Spark, MLflow, Delta Lake and Delta Sharing. We also contribute to the local community as a company and use open source on a daily basis.

We have been using containers for many years. For example, in MLflow, users build machine learning (ML) models as Docker images, store them in a container registry, and then deploy and run the models from the registry.

Another example is the Databricks notebook. Version-controlled container images simplify support for multiple Spark, Python, and Scala versions, and containers accelerate software development iterations and make production systems more stable.

Kubernetes and Hyperscale

We are well aware that container orchestration systems such as Kubernetes have their own challenges. The basic concepts of Kubernetes and their rich capabilities require an experienced and knowledgeable data engineering team.

However, Databricks has grown into a hyperscale environment in just a few years due to its successful build on containers that create open source software. Customers spin up millions of instances daily and support hundreds of thousands of data scientists each month.

Security and simplicity

Most important to us is to bring new capabilities to data engineers and data scientists faster. When designing Databricks on GCP, the engineering team considered the best options to meet security and scalability requirements. Our goal was to simplify the implementation and focus less on low-level infrastructure, dependencies, and instance lifecycles. Kubernetes allows engineers to leverage strong momentum from the open source community to drive infrastructure logic and security.

GKE and other Google Cloud Services

Critically assess the trade-off between the operational expertise required and the benefits of running a large upstream Kubernetes environment in a production environment, and ultimately use a self-managed Kubernetes cluster. I decided not to.

The main reason for choosing GKE instead is Google’s priority for rapid adoption of the new Kubernetes version and infrastructure security. Google’s GKE, the first creator of Kubernetes, is one of the most advanced managed Kubernetes services on the market.

On the one hand, Databricks is integrated with all major GCP cloud services such as Google Cloud Storage, Google BigQuery and Google Looker. On the other hand, our implementation is running on GKE.

Databricks is Google’s engine engine

Splitting a distributed system into a control plane and a user plane is a well-known design pattern. The task of the control plane is to manage and serve the customer’s configuration. Often a much larger data plane is for fulfilling customer requests.

GCP databricks follow the same pattern. The control plane operated by Databricks creates, manages, and monitors the data plane with your GCP account. The data plane contains the Spark cluster driver and executor nodes.

GKE cluster, namespace, custom resource declaration

When the Databricks account administrator launches a new Databricks workspace, the corresponding data plane is created in the customer’s GCP account as a regional GKE cluster in the VPC (see Figure 1). There is a 1: 1 relationship between your workspace, your GKE cluster, and your VPC. Workspace users do not directly interact with data plane resources. Instead, Databricks does this indirectly through the control plane, which enforces access control and resource isolation between workspace users. Databricks also intelligently deallocates GKE computing resources based on customer usage patterns to save costs.

GKE cluster and node pool

GKE clusters are bootstrapped in a pool of system nodes dedicated to running trusted services throughout the workspace. When launching a Databricks cluster, the user specifies the number of executor nodes and the machine types of driver and executor nodes. The cluster manager, which is part of the control plane, creates and maintains a GKE node pool for each of these machine types. Driver nodes and executor nodes often run on different machine types and are therefore served from different node pools.

Namespace

Kubernetes provides a namespace for creating virtual clusters with scope names (and therefore names). Individual Databricks clusters are separated from each other via the Kubernetes namespace within a single GKE cluster, and a single Databricks workspace can contain hundreds of Databricks clusters. GCP network policies isolate Databricks cluster networks within the same GKE cluster for even greater security. Nodes in a Databricks cluster can only communicate with other nodes in the same cluster (or use NAT gateways to access the Internet or other public GCP services).

Custom resource declaration

Kubernetes was designed from the ground up to allow you to customize and extend your API using the Kubernetes Custom Resource Declaration (CRD). Deploy the Databricks runtime (DBR) as a Kubernetes CRD for all Databricks clusters in your workspace.

Node pool, pods, sidecars

The Spark driver and executor are deployed as Kubernetes pods and run within the corresponding node pool nodes specified by the Kubernetes pod node selector. One GKE node is used exclusively by either the driver pod or the executor pod. The cluster namespace consists of Kubernetes memory requests and limits.

Databricks also runs some trusted daemon containers on each Kubernetes node along with a driver or executor container. These daemons are reliable sidecar services that facilitate data access and log collection on the node. The driver or executor container can only interact with the daemon container on the same pod through a restricted interface.

Frequently Asked Questions (FAQ)

Q: Can I deploy my own pods on the GKE cluster provided by Databricks?

Unable to access the Databricks GKE cluster. Limited for maximum security and configured for minimum resource usage.

Q: Can I deploy Databricks on my own custom GKE cluster?

Currently this is not supported.

Q: Can I use kubectl to access my Databricks GKE cluster?

The data plane of the GKE cluster is running under the customer account, but it has default access restrictions and firewall settings to prevent unauthorized access.

Q: Are GKE’s Databricks faster than VM and other cloud Databricks (such as cluster startup time)?

The answer to this question depends on many factors, so it is advisable to make your own measurements. One of the benefits of the Databricks multi-cloud offering is the ability to run such tests quickly. Our first test showed that GKE’s cold start-up time was faster for a large number of concurrent workers compared to other cloud products. Instances with equivalent local SSDs ran certain Spark workloads slightly faster than other clouds with similar compute core / memory / disk specifications.

Q: Why not use one GKE cluster per Databricks cluster?

For efficiency reasons. Databricks clusters are created frequently and some are short-lived (for example, for short-running jobs).

Q: How long does it take to start a cluster of 100 nodes?

Boot time is independent of the size of the cluster, as boots occur in parallel, even for large clusters with more than 100 nodes. It is recommended to measure the startup time of individual setups and settings.

Q: How can I optimize the way pods are assigned to nodes for cost efficiency? I would like to schedule multiple Spark executor pods on a larger node.

The pods are optimally configured by Databricks for their respective usage (driver node or worker node).

Q: Can I bring my own VPC for the GKE cluster?

If you are interested in this feature, please contact your Databricks Account Manager for future roadmaps.

Q: Is it safe for Databricks to run multiple Databricks clusters within a single GKE cluster?

Databricks clusters are completely isolated from each other using the Kubernetes namespace and GCP network policies. Only Databricks clusters in the same Databricks workspace share GKE clusters to reduce costs and speed provisioning. If you have multiple workspaces, they will run in your own GKE cluster.

Q: Does GKE add network overhead compared to using only VMs?

Initial testing of GCP using the iperf3 benchmark on us-west2 / 1 n1-standard-4 instances showed excellent pod-to-pod throughput of over 9Gbps. GCP generally provides a high throughput connection to the Internet with very low latency.

Q: Now that Databricks is completely containerized, can I pull the Databricks image and use it myself (such as a local Kubernetes cluster)?

Databricks does not currently support this.

Q: Is GCP Databricks restricted to one AZ in the region? How does node assignment to GKE actually work?

The GKE cluster uses all AZs in the region.

Q: What features does GCP Databricks include?

Check this link for the latest information.

The author would like to thank Silviu Tofan for his valuable feedback and support.

Try Databricks for free on GCP!

Sources 1/ https://Google.com/ 2/ https://databricks.com/blog/2021/08/06/how-we-built-databricks-on-google-kubernetes-engine-gke.html The mention sources can contact us to remove/changing this article

What Are The Main Benefits Of Comparing Car Insurance Quotes Online

LOS ANGELES, CA / ACCESSWIRE / June 24, 2020, / Compare-autoinsurance.Org has launched a new blog post that presents the main benefits of comparing multiple car insurance quotes. For more info and free online quotes, please visit https://compare-autoinsurance.Org/the-advantages-of-comparing-prices-with-car-insurance-quotes-online/ The modern society has numerous technological advantages. One important advantage is the speed at which information is sent and received. With the help of the internet, the shopping habits of many persons have drastically changed. The car insurance industry hasn't remained untouched by these changes. On the internet, drivers can compare insurance prices and find out which sellers have the best offers. View photos The advantages of comparing online car insurance quotes are the following: Online quotes can be obtained from anywhere and at any time. Unlike physical insurance agencies, websites don't have a specific schedule and they are available at any time. Drivers that have busy working schedules, can compare quotes from anywhere and at any time, even at midnight. Multiple choices. Almost all insurance providers, no matter if they are well-known brands or just local insurers, have an online presence. Online quotes will allow policyholders the chance to discover multiple insurance companies and check their prices. Drivers are no longer required to get quotes from just a few known insurance companies. Also, local and regional insurers can provide lower insurance rates for the same services. Accurate insurance estimates. Online quotes can only be accurate if the customers provide accurate and real info about their car models and driving history. Lying about past driving incidents can make the price estimates to be lower, but when dealing with an insurance company lying to them is useless. Usually, insurance companies will do research about a potential customer before granting him coverage. Online quotes can be sorted easily. Although drivers are recommended to not choose a policy just based on its price, drivers can easily sort quotes by insurance price. Using brokerage websites will allow drivers to get quotes from multiple insurers, thus making the comparison faster and easier. For additional info, money-saving tips, and free car insurance quotes, visit https://compare-autoinsurance.Org/ Compare-autoinsurance.Org is an online provider of life, home, health, and auto insurance quotes. This website is unique because it does not simply stick to one kind of insurance provider, but brings the clients the best deals from many different online insurance carriers. In this way, clients have access to offers from multiple carriers all in one place: this website. On this site, customers have access to quotes for insurance plans from various agencies, such as local or nationwide agencies, brand names insurance companies, etc. "Online quotes can easily help drivers obtain better car insurance deals. All they have to do is to complete an online form with accurate and real info, then compare prices", said Russell Rabichev, Marketing Director of Internet Marketing Company. CONTACT: Company Name: Internet Marketing CompanyPerson for contact Name: Gurgu CPhone Number: (818) 359-3898Email: [email protected]: https://compare-autoinsurance.Org/ SOURCE: Compare-autoinsurance.Org View source version on accesswire.Com:https://www.Accesswire.Com/595055/What-Are-The-Main-Benefits-Of-Comparing-Car-Insurance-Quotes-Online View photos