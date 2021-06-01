



Apache Beam is an integrated programming model that supports both batch and stream processing.

In the 21st century, most companies rely on scalable platforms and dataization of services or products to remain competitive in the market. In addition, companies are in need of new data strategies due to the proliferation of data from different sources of different volumes, speeds, and varieties. Therefore, the need for a data pipeline feels to integrate data from all different sources into a common destination for rapid analysis, or to process and stream data between connected applications and systems. Was done.

As a result, organizations have begun to deploy either batch or streaming pipelines based on their business needs.

Pause: Before moving on, take a quick look at the memory lanes to understand. What is the difference between batch and streaming pipelines?

Batch processing:

Used in bounded datasets. A series of data is collected over a period of time and processed in a single shot. I’m more interested in throughput than latency. Use Case: Find a loyal customer in your bank.Difference in sales after discount, etc.

Stream processing:

Used in unbounded datasets. The data is sent to the processing engine as soon as it is generated. I’m more interested in latency than throughput. Use case: Stock market sentiment analysis. Real-time detection of fraudulent transactions, IoT devices, etc.

Resume: To set up the next data pipeline, organizations need to deploy different programming models such as Hadoop, Spark, and Flink with different abstractions and APIs for processing batch and streaming data. There is. For example. Spark uses RDDs / dataframes for batch processing, but you need to program the data stream for stream processing. Therefore, two different pipelines need to be maintained with their respective execution engines. This not only contributes to the overall maintenance overhead, but also locks them in the associated execution engine.

To alleviate these challenges, Google incubated a Dataflow model that can be applied to both bounded and non-bounded datasets and donated its SDK to the Apache Foundation.

Since then, a community of contributors has grown it, making Apache Beam easy-to-use data parallelism for both streaming and batch workflows, and most important platform independence (portable with the support of multiple runners). It became an integrated programming model with. Eliminate API lock-ins.

Apache Beam is an integrated programming model for both batch and stream processing, with an abstraction layer that can be created in any language (Java, Python, Go, etc.) and any, such as Google Cloud Dataflow, Spark, Flink, etc. It can be executed with the execution framework of. ,Such.

Author’s Apache Beam Architecture Programming Language SDK Select Java, Python, or the Go.Beam / Runner API to write a pipeline and convert it to a language general-purpose standard that can be used by the execution engine. FnAPI provides a language-specific SDK worker that works. As a UDF RPC interface embedded in the pipeline as a function specification. The selected runner runs the pipeline with the underlying resources, and the correct choice of runners is key to efficient execution.





