Apache Beam 简介

原创 2018年04月15日 12:10:59

https://beam.apache.org/get-started/beam-overview/

Apache Beam Overview

Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines. Using one of the open source Beam SDKs, you build a program that defines the pipeline. The pipeline is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow.

Beam is particularly useful for Embarrassingly Parallel data processing tasks, in which the problem can be decomposed into many smaller bundles of data that can be processed independently and in parallel. You can also use Beam for Extract, Transform, and Load (ETL) tasks and pure data integration. These tasks are useful for moving data between different storage media and data sources, transforming data into a more desirable format, or loading data onto a new system.

Apache Beam SDKs

The Beam SDKs provide a unified programming model that can represent and transform data sets of any size, whether the input is a finite data set from a batch data source, or an infinite data set from a streaming data source. The Beam SDKs use the same classes to represent both bounded and unbounded data, and the same transforms to operate on that data. You use the Beam SDK of your choice to build a program that defines your data processing pipeline.

Apache Beam Pipeline Runners

The Beam Pipeline Runners translate the data processing pipeline you define with your Beam program into the API compatible with the distributed processing back-end of your choice. When you run your Beam program, you’ll need to specify an appropriate runner for the back-end where you want to execute your pipeline.

Beam currently supports Runners that work with the following distributed processing back-ends:

  • Apache Apex
  • Apache Flink
  • Apache Apache Gearpump (incubating)
  • Apache Spark
  • Google Cloud Dataflow

Note: You can always execute your pipeline locally for testing and debugging purposes.

Apache Beam简介

Apache Beam的前世今生      Apache Beam前身是Google Dataflow SDK,DataFlow是谷歌的提供大数据计算平台。在DataFlow之前,谷歌的批处理和流处理...
  • ffjl1985
  • ffjl1985
  • 2017-09-20 23:46:26
  • 565

ApacheBeam:大数据处理的一大神器!

你知道被认为继MapReduce、GFS、BigQuery等之后,Google在大数据处理领域对开源社区的又一大贡献是哪个项目吗?答案是ApacheBeam。事实上,“Beam”这个项目名称已经很清楚...
  • dashenghuahua
  • dashenghuahua
  • 2016-12-05 10:08:32
  • 12748

Apache Beam发布--- apache beam概述

美国时间 2017年1 月 10 日,Apache 软件基金会对外宣布,万众期待的 Apache Beam 在经历了近一年的孵化之后终于毕业。这一顶级 Apache 开源项目终于成熟。 这是大数...
  • yuxin6866
  • yuxin6866
  • 2017-01-12 10:36:50
  • 2254

Apache Beam初探

Apache BeamApache Beam provides an advanced unified programming model, allowing you to implement bat...
  • snsn1984
  • snsn1984
  • 2017-01-17 10:10:30
  • 1896

Apache Beam开发指南

本指南用于指导Beam用户使用Beam SDK创建数据处理pipeline(pipeline)。本文会引导您用BeamSDK类构建和测试你的pipeline。本文不会详尽阐述所有内容,但可以看做一门未...
  • blwinner
  • blwinner
  • 2017-02-07 11:36:15
  • 4626

Apache Beam是什么?

Apache Beam 的前世今生  1月10日,Apache软件基金会宣布,Apache Beam成功孵化,成为该基金会的一个新的顶级项目,基于Apache V2许可证开源。 2003年,...
  • chenhaifeng2016
  • chenhaifeng2016
  • 2017-05-19 10:46:52
  • 1056

Apache Beam WordCount编程实战及源码解读

概述:Apache Beam WordCount编程实战及源码解读,并通过intellij IDEA和terminal两种方式调试运行WordCount程序,Apache Beam对大数据的批处理和流...
  • dream_an
  • dream_an
  • 2017-02-21 10:41:31
  • 5439

Apache Beam 剖析

在大数据的浪潮之下,技术的更新迭代十分频繁。受技术开源的影响,大数据开发者提供了十分丰富的工具。但也因为如此,增加了开发者选择合适工具的难度。在大数据处理一些问题的时候,往往使用的技术是多样化的。这完...
  • haitianxueyuan521
  • haitianxueyuan521
  • 2017-06-16 08:30:52
  • 380

Apache Beam+Spark教程

Apache Beam简单的调用Spark集群,执行去重功能的例子。
  • jianglushou9763
  • jianglushou9763
  • 2017-06-16 16:12:16
  • 486

Apache Beam处理Kafka数据源源码分析

Apache Beam将Kafka作为数据输入的实践案例源码分析:   首先,我们建立一个maven工程,在添加原始的Beam依赖之后,还需要添加如下的支持Kafka的依赖 ...
  • weixin_37524138
  • weixin_37524138
  • 2017-02-14 11:35:26
  • 742
收藏助手
不良信息举报
您举报文章:Apache Beam 简介
举报原因:
原因补充:

(最多只允许输入30个字)