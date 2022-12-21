



Today, data is getting more massive every second (big data is essentially what it is) and sources like emails, transaction data, log files, text files, audio files, e-commerce websites are growing exponentially. (Amazon, Flipkart, Myntra, etc.), Healthcare, Supply Chain, Logistics (Shipping, FedEx). These data grow exponentially, making storage complex (as mentioned above) and very unstructured (making traditional storage useless). For these problems and challenges comes data warehouses.

A data warehouse essentially aggregates data from disparate sources into a central, consistent data repository to generate business insights and support effective decision making.

Create a data warehouse using BigQuery, a serverless, fully managed, multi-cloud data warehouse Platform as a Service (PaaS) service provided by Google Cloud Platform.

Aggregate data from disparate pipelines and sources into a centralized, consistent data repository with a defined, fixed schema to generate business insights and aid decision-making processes.

Image source DataBricks

In the diagram above, data from multiple sources is loaded within a specified time period from multiple applications, and the collected data is formatted into the existing defined and fixed schema of the data warehouse. Organizations use this processed data for a variety of purposes, including reporting, analysis, and data mining.

Data sources can come from many different places (healthcare, ERP, transaction processing systems, supply chains, etc.) and in many different formats (structured, unstructured, semi-structured, etc.).

Now that you know what a data warehouse is and how it works, there are several cloud-based data warehouse tools such as Google BigQuery, Amazon RedShift, Azure Sypnase, and Teradata.

This article focuses on BigQuery.

It is a serverless, fully managed, multi-cloud data warehouse Platform as a Service (PaaS) service offered by Google Cloud Platform.

You can access data directly from external sources and analyze it in Big Query without having to import the data using federated queries. It also provides a query engine for SQL for building SQL queries and has built-in machine learning capabilities.

Provides prescriptive and descriptive analytics, data storage for storing data, centralized data management and computing resources. It fully supports ACID transactions.

Image Source Google Cloud

The Big Query architecture helps customers develop data warehouses in a cost-effective manner without worrying about infrastructure, security, database operations, or system engineering.

Image source Google

Big Query architecture uses a variety of low-level infrastructure technologies such as Borg, Dremel, Colossus, and Jupiter.

Colossus: Colossus performs data recovery, replication, and distributed management operations to ensure data security.

Dremel: Dremel is a multi-tenant cluster that transforms SQL queries into execution trees for query execution.

Jupiter: Storage and computing resources communicate using the Jupiter network.

Borg: Borg orchestrates by automating configuration, management, and coordination between services within Big Query.

Image Source Google Cloud

Suppose you have an e-commerce company that sells K-pop merchandise that stores data from two different sources (a native store and an online store). Scrape data from websites based on type. analysis.

I am currently developing a data warehouse in Big Query to aggregate data from all these sources, perform some operations, query the data to gain valuable insights and make effective decisions. going.

Create a project on GCP,

Google cloud platform

Create a dataset in Big Query

Resources -> Big Query -> ProjectName -> Create dataset

Create a dataset for the created project

Download the dataset from kaggle and load it into the dataset you just created.

Created an imported table showing the schema uploading the table to the dataset

analyze the tables imported into the dataset,

Display the first 1000 rows of the table Display individual artists and their respective quantile products Display items that are not sold out while inventory counts exist

