spark-streaming入门（三）

最新推荐文章于 2022-03-01 21:02:36 发布

qq_23660243

最新推荐文章于 2022-03-01 21:02:36 发布

阅读量941

点赞数

分类专栏：大数据文章标签： spark streaming

本文链接：https://blog.csdn.net/qq_23660243/article/details/51524874

版权

本文详细介绍了如何配置Apache Flume与Spark Streaming进行数据集成，包括两种方法：Flume风格的Push-based Approach和Pull-based Approach。Push-based方法中，Flume将数据推送到Spark Streaming的Avro接收器，而Pull-based方法运行自定义Flume sink，通过Spark Streaming的可靠Flume接收器拉取数据，提供更强的可靠性和容错性。

摘要由CSDN通过智能技术生成

Spark Streaming + Flume Integration Guide

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Here we explain how to configure Flume and Spark Streaming to receive data from Flume. There are two approaches to this.

【译】spark Streaming+flume嵌套指南

apache flume是一个分布式的、可靠的、可用的服务，该服务是为了高效的收集、合并、和转移大量日志数据。这里我们解释如何配置从flume使得spark streaming从flume中获取数据。有两种方式：

Approach 1: Flume-style Push-based Approach

Flume is designed to push data between Flume agents. In this approach, Spark Streaming essentially sets up a receiver that acts an Avro agent for Flume, to which Flume can push the data. Here are the configuration steps.

General Requirements

Choose a machine in your cluster such that

When your Flume + Spark Streaming application is launched, one of the Spark workers must run on that machine.

Flume can be configured to push data to a port on that machine.

Due to the push model, the streaming application needs to be up, with the receiver scheduled and listening on the chosen port, for Flume to be able push data.

【译】方法一:flume格式的基于push的方式

flume被设计用来在flume的代理中push数据。在这种方法下，spark streaming本质上是建立一个Receiver用来作为flume的avro 代理，通过该代理，我们可以使用flume进行push数据。以下是配置的过程：

大体需要：

在你的集群上选个机器比如（以下是对该机器的要求）：

1.当你的flume+spark streaming应用提交，spark workers其中的一个必须运行在这台机器上

2.flume可以被设置push数据到这台机器的端口上

由于push的机制，streaming的应用需要在receiver被安排和监听在指定的端口的情况下上传，为了flume能够push数据。

Configuring Flume

Configure Flume agent to send data to an Avro sink by having the following in the configuration file.

【译】flume的配置

通过如下的配置文件来配置flume从而使得flume代理能够发送数据到一个avro sink：

agent.sinks = avroSink

agent.sinks.avroSink.type = avro

agent.sinks.avroSink.channel = memoryChannel

agent.sinks.avroSink.hostname = <chosen machine's hostname>

agent.sinks.avroSink.port = <chosen port on the machine>

Configuring Spark Streaming Application

Linking: In your SBT/Maven projrect definition, link your streaming application against the following artifact (see Linking section in the main programming guide for further information).

【译】配置spark streaming的应用

1.Linking（链接）:在你的SBT或者Maven项目的定义中，链接你的streaming应用到如下的artifact中：

groupId = org.apache.spark

artifactId = spark-streaming-flume_2.10

version = 1.6.1

【注意】这里要注意一下，我的maven远程仓库是中国的镜像，1.6.1每次都下不下来，我换成1.6.0就可以了。让他去下载我给他半个小时居然都没搞定。
2. Programming: In the streaming application code, import FlumeUtils and create input DStream as follows.

【译】Programming（程序）：在streaming的应用代码中，导入FlumeUtils然后创建DStream如下：

import org.apache.spark.streaming.flume._

val flumeStream = FlumeUtils.createStream(streamingContext, [chosen machine's hostname], [chosen port])

Note that the hostname should be the same as the one used by the resource manager in the cluster (Mesos, YARN or Spark Standalone), so that resource allocation can match the names and launch the receiver in the right machine.