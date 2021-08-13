



Google Sheets is a very popular office tool on Spotify. One of the benefits of BigQuery is that you can interact with the data stored in Google Sheets. Connected Sheets, a new BigQuery data connector, allows you to access, analyze, visualize, and share billions of rows of data from your spreadsheets. This is very useful for data scientists and engineers to collaborate with non-technical users. However, when it comes to Google Sheets, BigQuery has some unexpected behavior that may not seem intuitive at first. This article aims to delve into the various use cases of Google Sheets in BigQuery, especially to avoid confusion due to unexpected behavior in the Python pipeline environment.

The phrase create table can be misleading. In this case, the more accurate term is create table connection. BigQuery establishes an active connection rather than actually importing the dataset into BigQuery. However, the BigQuery console uses this term, so we’ll use it here.

Here’s an overview of what you need to do to create a table from Google Sheets:

[テーブルの作成]from[ドライブ]Select to copy the sheet URI and paste it into the next box.As a file format[Googleシート]Choose. Make a note of the table’s metadata, such as the project, dataset, and table name. We recommend that you enter the schema manually, as auto-discovery that defines the schema of the table does not always work correctly. Defines the header line to skip. 0 if there is no header, n if there are n header rows to skip Create a table from a Google spreadsheet.

Once the table is created, you can query it with BigQuery. However, one of the common mistakes is to think that the data is now stored in BigQuery as a separate copy. It’s not. All BigQuery does here is to create an active connection to Google Sheets, which means:

If you place a filter in Google Sheets, BigQuery will not be able to view the entire dataset. The filters in Google Sheets are actually reflected immediately and affect the appearance of your data in BigQuery. Therefore, whenever you want to access all of BigQuery’s data, use filters for individual views. You cannot edit an existing BigQuery schema if there are schema changes such as column renaming or column type changes. Drop the current BigQuery table and start the table creation process again.

If you’re trying to run a job using a service account, be sure to add the service account as an editor for Google Sheets.

The most important step in setting up Google Sheets to read as a BigQuery table is to change the scope of your BigQuery client with the Python BigQuery API.

Import pandas as pdfromgoogle.cloud importbigqueryscopes = (‘https://www.googleapis.com/auth/drive’,’https://www.googleapis.com/auth/drive.file’,’https: / / www.googleapis.com/auth/spreadsheets’,) bigquery.Client.SCOPE + = scopesclient = bigquery.Client ()

The following is a line of code that transforms the query results into a pandas dataframe.

gs = client.query (“” “SELECT * FROM`YOUR_PROJECT.YOUR_DATASET. {Table}` “” “. format (table = table_name)). result (). to_dataframe ()

There are several packages that you can write to Google Sheets in Python. Use pygsheets as an example. If you don’t plan to run jobs in your pipeline, the package is pretty straightforward. However, there is a workaround if you want to use the package in your pipeline. For security reasons, the service account key file cannot be stored on the cloud server, so normal functionality will not work. Let’s look at some examples of these two scenarios.

Suppose you plan to use the service account key for this job. Here is a brief introduction to your service account.

A service account is an account associated with email. Your account is authenticated with a public / private key pair, so it is more secure than other options as long as your private key remains private.Pie sheet

To create a service account, follow these steps:

[クレデンシャル]Go to tab and[クレデンシャルの作成]>[サービスアカウントキー]Choose. Then select the service account as the App Engine default, select the key type as JSON, and[作成]Click.

3. Then click the download button to download the client_secret.[].json file; don’t forget the location where you saved it as a local json file.

Now that the service account key file is ready, let’s import the required packages.

Import request importjsonimport pygsheetsimport styx_secretsfrom google.oauth2 import service_accountfrom google.cloud import bigqueryimport pandas as pdimport numpy as np

The approval process is simple.

gc = pygsheets.authorize (client_secret =’path / to / client_secret[…].json’)

However, if you have a pipeline job, the workaround is to use the dictionary as a key, which makes the process a bit more complicated.

First, you need to encrypt your service account key. Spotify has this package called stix-secrets for encryption purposes. Note that utf-8 decoding is required for pygsheets to recognize this key.

import styx_secretskey1 = styx_secrets.decrypt (“FIRST_HALF_OF_ENCRYPTED_KEY_STRING”). decode (“utf-8”) key2 = styx_secrets.decrypt (“SECOND_HALF_OF_ENCRYPTED_KEY_STRING”). decode (“utf-8”)

The original key string was too long, so I had to split the key into two strings and styx-secrets didn’t get it.

Then you can prepare the key file in the dictionary. This is basically a copy and paste of the json key file, except that it replaces the actual key with an encrypted string (note that you will need to reformat it after decryption). ).

sc_key = {“type”: “service_account”, “project_id”: “YOUR_PROJECT”, “private_key_id”: “YOUR_PRIVATE_KEY_ID”, “private_key”: (“—– BEGIN PRIVATE KEY —–” + key1 + key2 + “—– END PRIVATE KEY —– n”). replace (“\ n”, ” n”), “client_email”: “YOUR_SERVICE ACCOUNT_EMAIL”, “client_id”: “SERVICE_ACCOUNT_CLIENT_ID”, “auth_uri”: “https://accounts.google.com/o/oauth2/” auth “,” token_uri “:” https://oauth2.googleapis.com/token “,” auth_provider_x509_cert_url “:” https: / /www.googleapis.com/oauth2/v1/certs “,” client_x509_cert_url “:” https: //www.googleapis.com/robot/v1/metadata/x509/xxxxxxxx “} credentials = service_account.Credentials.from_service_account_info (sc_key, scope =[‘https://www.googleapis.com/auth/spreadsheets’,’https://www.googleapis.com/auth/drive’,’https://www.googleapis.com/auth/drive.file’, ]) Gc = pygsheets.authorize (custom_credentials = credentials)

Now that the approval is complete, you can have a good time. Here is an example for writing to Google Sheets.

Suppose you have a Google Sheets that tracks your pipeline incidents so your team can easily analyze the data.

# Open Google Sheets sh = gc.open (‘Data Incident Tracker’)

# Select the first sheetwks = sh[0]

# Import the previous event track_prev = client.query (“” “SELECT * FROM`YOUR_PROJECT.YOUR_DATASET. {Table}` “” “. Format (table = table_name)). result (). to_dataframe ()

# Current eventtrack_now = [{‘incident_date’: date_param,’num_of_tables’: len(missing_list),’table_list’: ‘, ‘.join(missing_list),’alert’: True,’root_cause’: root_cause}, ]

# Create a new dataframe for every incident

tracker_df = pd.concat ([track_prev, pd.DataFrame(track_now)]).

# Update the first sheet with tracker_df

wks.set_dataframe (tracker_df.replace (np.nan,”, regex = True),’A1′)

Finally, note that if the table has None values, they will actually be uploaded as a NaN string in Google Sheets. To work around this, use replace (np.nan ,, regex = True) to replace them with an empty string. The cell will then be uploaded to Google Sheets as the actual Null.

This is the last article in the Python series of Google BigQuery. If its helpful then im happy. If you have any questions, please feel free to contact us or comment. toast!

