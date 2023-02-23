



Cloud Run lets you run your code directly as a web service on Google’s infrastructure. You can configure Pub/Sub to send all messages as HTTP requests using push subscriptions to the Cloud Run service’s HTTPS endpoint. When the request comes in, the code does the work and calls the BigQuery Storage Write API to insert the data into the BigQuery table. You can use any programming language and framework with Cloud Run.

As of February 2022, push subscriptions are the recommended way to integrate Pub/Sub with Cloud Run. A push subscription automatically retries a request if it fails, and you can set up a dead-letter topic to receive messages that have failed all delivery attempts. For more information, see Handling message failures.

There may be moments when no data is sent to the pipeline. In this case, Cloud Run automatically scales the instance count to zero. Conversely, scale up to 1,000 container instances to handle peak loads. If cost is a concern, you can set a maximum number of instances.

Cloud Run makes it easy to evolve your data schema. Established tools such as Liquibase can be used to define and manage data schema migrations. Learn more about using Liquibase with BigQuery.

For increased security, set the Cloud Run microservices ingress policy to internal so that it can only be accessed by Pub/Sub (and other internal services), create a service account for the subscription, and Allow access only to service accounts. CloudRun service. More information on how to set up push subscriptions in a secure way can be found here.

Consider using Cloud Run as the processing component of your pipeline if:

You can process messages individually without grouping and aggregating them.

I prefer to use a generic programming model rather than using specialized SDKs.

We already use Cloud Run to deliver our web applications, and we value simplicity and consistency in our solution architecture.

Tip: The Storage Write API uses gRPC streaming instead of REST over HTTP, so it’s more efficient than the old insertAll method.

Approach 3: Advanced processing and aggregation of messages using Dataflow

Cloud Dataflow, a fully managed service for running Apache Beam pipelines on Google Cloud, has long been the foundation for building streaming pipelines on Google Cloud. This works well for pipelines that aggregate groups of data to reduce data or pipelines that have multiple processing steps. Cloud Dataflow has a UI that makes it easy to troubleshoot issues with multi-step pipelines.

In data streams, grouping is done using windowing. A window function groups an unbounded collection by timestamp. Multiple windowing strategies are available, including tumbling, hopping, and session windows. For more information, see the data streaming documentation.

Cloud Dataflow can also be leveraged for AI/ML workloads and is suitable for users who want to use Tensorflow to preprocess, train, and predict machine learning models. Here is a list of great tutorials that integrate Dataflow into end-to-end machine learning workflows.

Cloud Dataflow is intended for large-scale data processing. Spotify specifically uses it to calculate personalized rap playlists for the year. Read an insightful blog post about the 2020 Wrap Pipeline on the Spotify Engineering Blog.

Dataflow can autoscale clusters both vertically and horizontally. A user can even use his GPU-powered instances in a cluster. Cloud Dataflow brings new workers into the cluster to meet demand and discards them after they are no longer needed.

Tip: Limit the maximum number of workers in your cluster to reduce costs and set up billing alerts.

Which approach should you choose?

