dataflow_Apache Beam,Google Cloud Dataflow和使用Python创建自定义模板

dataflow

阿帕奇光束 (Apache Beam)

Apache Beam(Batch + Stream) is a unified programming model that defines and executes both batch and streaming data processing jobs. It provides SDKs for running data pipelines and runners to execute them.

Apache的光束(B ATCH +海峡EAM)是一个统一的编程模型,定义并执行分批和流数据处理作业。 它提供了用于运行数据管道和运行程序的SDK。

Apache Beam can provide value in use cases that involve data movement from different storage layers, data transformations, and real-time data processing jobs.

Apache Beam可以在涉及不同存储层中的数据移动,数据转换和实时数据处理作业的用例中提供价值。

There are three fundamental concepts in Apache Beam, namely:

Apache Beam中有三个基本概念,即:

  • Pipeline — encapsulates the entire data processing tasks and represents a directed acyclic graph(DAG) of PCollection and PTransform. It is analogous to Spark Context.

    管道—封装了整个数据处理任务,并表示PCollectionPTransform的有向无环图(DAG) 它类似于Spark Context。

  • PCollection — represents a data set which can be a fixed batch or a stream of data. We can think it of as a Spark RDD.

    PCollection —表示一个数据集,它可以是固定的批处理或数据流。 我们可以将其视为Spark RDD。

  • PTransform — a data processing operation that takes one or more PCollections and outputs zero or more PCollections. It can be considered as a Spark transformation/action on RDDs to output the result.

    PTransform —一种数据处理操作,采用一个或多个PCollection并输出零个或多个PCollection 。 可以将其视为RDD上的Spark转换/操作,以输出结果。

Apache Beam is designed to enable pipelines to be portable across different runners. In the below example, the pipeline is executed locally using the DirectRunner which is great for developing, testing, and debugging.

Apache Beam旨在使管道可以在不同的运行程序之间移植。 在下面的示例中,使用DirectRunner在本地执行管道,这对于开发,测试和调试非常有用。

WordCount Example(“Bigdata Hello World”):

WordCount示例(“ Bigdata Hello World”):

import apache_beam as beamfrom apache_beam.options.pipeline_options import PipelineOptionswith beam.Pipeline(options=PipelineOptions()) as p:
lines = p | 'Creating PCollection' >> beam.Create(['Hello', 'Hello Good Morning', 'GoodBye'])
counts = (
lines
| 'Tokenizing' >> (beam.FlatMap(lambda x: x.split(' '))
)
| 'Pairing With One' >> beam.Map(lambda x: (x, 1))
| 'GroupbyKey And Sum' >> beam.CombinePerKey(sum)
| 'Printing' >> beam.ParDo(lambda x: print(x[0], x[1])))

Let’s have a brief review of what code is doing.

让我们简要回顾一下代码在做什么。

beam.Pipeline(options=PipelineOptions())- creates a beam pipeline by taking in the configuration options.

beam.Pipeline(options=PipelineOptions()) -通过接受配置选项来创建光束管道。

beam.Create- creates a PCollection from the data.

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值