dataflow
阿帕奇光束 (Apache Beam)
Apache Beam(Batch + Stream) is a unified programming model that defines and executes both batch and streaming data processing jobs. It provides SDKs for running data pipelines and runners to execute them.
Apache的光束(B ATCH +海峡EAM)是一个统一的编程模型,定义并执行分批和流数据处理作业。 它提供了用于运行数据管道和运行程序的SDK。
Apache Beam can provide value in use cases that involve data movement from different storage layers, data transformations, and real-time data processing jobs.
Apache Beam可以在涉及不同存储层中的数据移动,数据转换和实时数据处理作业的用例中提供价值。
There are three fundamental concepts in Apache Beam, namely:
Apache Beam中有三个基本概念,即:
Pipeline — encapsulates the entire data processing tasks and represents a directed acyclic graph(DAG) of PCollection and PTransform. It is analogous to Spark Context.
管道—封装了整个数据处理任务,并表示PCollection和PTransform的有向无环图(DAG) 。 它类似于Spark Context。
PCollection — represents a data set which can be a fixed batch or a stream of data. We can think it of as a Spark RDD.
PCollection —表示一个数据集,它可以是固定的批处理或数据流。 我们可以将其视为Spark RDD。
PTransform — a data processing operation that takes one or more PCollections and outputs zero or more PCollections. It can be considered as a Spark transformation/action on RDDs to output the result.
PTransform —一种数据处理操作,采用一个或多个PCollection并输出零个或多个PCollection 。 可以将其视为RDD上的Spark转换/操作,以输出结果。
Apache Beam is designed to enable pipelines to be portable across different runners. In the below example, the pipeline is executed locally using the DirectRunner which is great for developing, testing, and debugging.
Apache Beam旨在使管道可以在不同的运行程序之间移植。 在下面的示例中,使用DirectRunner在本地执行管道,这对于开发,测试和调试非常有用。
WordCount Example(“Bigdata Hello World”):
WordCount示例(“ Bigdata Hello World”):
import apache_beam as beamfrom apache_beam.options.pipeline_options import PipelineOptionswith beam.Pipeline(options=PipelineOptions()) as p:
lines = p | 'Creating PCollection' >> beam.Create(['Hello', 'Hello Good Morning', 'GoodBye'])
counts = (
lines
| 'Tokenizing' >> (beam.FlatMap(lambda x: x.split(' '))
)
| 'Pairing With One' >> beam.Map(lambda x: (x, 1))
| 'GroupbyKey And Sum' >> beam.CombinePerKey(sum)
| 'Printing' >> beam.ParDo(lambda x: print(x[0], x[1])))
Let’s have a brief review of what code is doing.
让我们简要回顾一下代码在做什么。
beam.Pipeline(options=PipelineOptions())
- creates a beam pipeline by taking in the configuration options.
beam.Pipeline(options=PipelineOptions())
-通过接受配置选项来创建光束管道。
beam.Create
- creates a PCollection from the data.