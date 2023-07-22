



Google recently announced the general availability of the Hive-BigQuery connector, which simplifies integration and migration between Apache Hive and Google BigQuery. The open-source connector is a Hive storage handler that allows Hive to interact with BigQuery’s storage layer.

A new option supports querying in Hive using HiveQL, a SQL-like query language to read from and write to BigQuery. Data engineers can access and query BigQuery datasets without moving the data, and BigQuery users can leverage Hive’s tools, libraries, and frameworks for data processing and analysis. Julien Phalip, Solutions Architect at Google Cloud, writes:

The Hive-BigQuery connector implements the Hive StorageHandler API, allowing you to integrate your Hive workloads with BigQuery and BigLake tables. Hive’s execution engine still handles all compute operations such as aggregations and joins, but the connector manages all interactions with BigQuery’s data layer, whether the underlying data is stored in BigQuery native storage or in a Cloud Storage bucket via a BigLake connection.

Apache Hive, a popular distributed data warehouse option built on Hadoop, allows users to query large datasets. BigQuery, a serverless data warehouse on Google Cloud, provides scalable queries over large datasets. The open-source connector uses Hive metadata to represent tables stored in BigQuery, ensuring data consistency and reliability.

The connector supports queries using the MapReduce and Tez execution engines, creating and deleting BigQuery tables from Hive, and joining BigQuery and BigLake tables with Hive tables. It also supports fast reads from BigQuery tables using Storage Read API streams and Apache Arrow format.

Source: Google Data Analytics Blog

According to cloud providers, Hive-BigQuery connectors can help companies in scenarios such as ensuring operational continuity during migration, using BigQuery for a subset of their data warehouse needs, and maintaining a full open-source software stack.

With the BigQuery Migration Service, Google already supports the BigQuery Batch SQL Translator and the Interactive SQL Translator, converting Hive queries to BigQuery’s own ANSI-compliant SQL syntax. Farip explains:

The new Hive-BigQuery connector has one additional option. He can keep the original queries in HiveQL language and continue to execute those queries in his Hive execution engine on the cluster, but the data migrated to BigQuery and BigLake tables will be accessible to those queries.

This is not the first open-source connector Google has released to reduce data transformations and enable analysis of diverse datasets. The Cloud Storage Connector implements the Hadoop Compatible File System (HCFS) API for storing and accessing data files in Cloud Storage. The Apache Spark SQL connector for BigQuery, on the other hand, implements the Spark SQL data source API for reading BigQuery tables into DataFrames in Spark and writing DataFrames back to BigQuery.

The Hive-BigQuery connector supports Dataproc 2.0 and 2.1. Google has outlined some limitations regarding partitioning. The Hive PARTITIONED BY clause is not supported due to differences in Hive and BigQuery partitioning behavior. However, developers can continue to use time-scaled columns and ingestion time partitioning options supported by BigQuery.

The connector is available on GitHub.

