SparkStreaming整合flume

最新推荐文章于 2022-07-14 09:58:12 发布

2NaCl

最新推荐文章于 2022-07-14 09:58:12 发布

阅读量224

点赞数 1

分类专栏：分布式计算文章标签： spark flume

本文链接：https://blog.csdn.net/qq_41936805/article/details/99624185

版权

分布式计算专栏收录该内容

13 篇文章 1 订阅

订阅专栏

文章目录

- - 目标一：Flume-style Push-based Approach
  - 目标二：Push-based Approach using a Custom Sink

SparkStreaming整合flume有两种方式，下面会一一列举这两个Demo
github地址：https://github.com/2NaCl/spark_flume_demo

目标一：Flume-style Push-based Approach

首先来看一下官方文档，之前所介绍的socket或者fileSystem都属于基本数据源，但是在这里，我们要主要介绍一下高级数据源。

在这里插入图片描述

这是官网给出的三种高级数据源，我们来主要看一下Flume的相关文档

大意是,我们可以把数据放入多个Flume agent之间，可以串联放入，可以并联放入，然后，sparkstreaming作为一个 avro 的接收方，接收flume采集过来的数据。

配置方法是

让flume和Worker启动在一台节点上
Flume要配置之后将数据发送给一个端口之中

另外，因为SparkStreaming是接收数据的，所以要先启动，并且监听一个flume注入数据的端口

首先进行一下配置flume

# Name the components on this agent
simple-agent.sources = netcat-source
simple-agent.sinks = avro-sink
simple-agent.channels = memory-channel

# Describe/configure the source
simple-agent.sources.netcat-source.type = netcat
simple-agent.sources.netcat-source.bind = linux01
simple-agent.sources.netcat-source.port = 44444

# Describe the sink
simple-agent.sinks.avro-sink.type = avro
simple-agent.sinks.avro-sink.hostname = linux01
simple-agent.sinks.avro-sink.port = 41414 

# Use a channel which buffers events in memory
simple-agent.channels.memory-channel.type = memory
simple-agent.channels.memory-channel.capacity = 1000
simple-agent.channels.memory-channel.transactionCapacity = 100

# Bind the source and sink to the channel
simple-agent.sources.netcat-source.channels = memory-channel
simple-agent.sinks.avro-sink.channel = memory-channel

书写SparkStreaming应用程序，导入FlumeUtils创建DStream

首先导入新依赖

		<dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-flume_2.11</artifactId>
            <version>2.1.1</version>
        </dependency>

然后写一个Push方式的wordcount demo
也是先进入配置
在这里插入图片描述
获取输入的内容，进行拆分，因为我们知道flume 在传送数据的时候是有header有body的，我们只要他们的body的内容，所以我们要利用方法去除header，并且删除前后的空白符

在这里插入图片描述
然后按照正常wordcount的计算就可以了

本地测试

在本地测试中，我们需要将flume的配置中，sink的配置改成主机ip地址，而不是服务器地址，然后启动sparkstreaming，然后启动flume，用talent输入数据，观察idea控制台的输出
simple-agent.conf

# Name the components on this agent
simple-agent.sources = netcat-source
simple-agent.sinks = avro-sink
simple-agent.channels = memory-channel

# Describe/configure the source
simple-agent.sources.netcat-source.type = netcat
simple-agent.sources.netcat-source.bind = linux01
simple-agent.sources.netcat-source.port = 44444

# Describe the sink
simple-agent.sinks.avro-sink.type = avro
simple-agent.sinks.avro-sink.hostname = 192.168.1.101
simple-agent.sinks.avro-sink.port = 41414

# Use a channel which buffers events in memory
simple-agent.channels.memory-channel.type = memory
simple-agent.channels.memory-channel.capacity = 1000
simple-agent.channels.memory-channel.transactionCapacity = 100

# Bind the source and sink to the channel
simple-agent.sources.netcat-source.channels = memory-channel
simple-agent.sinks.avro-sink.channel = memory-channel

然后启动flume

flume-ng agent 
--name simple-agent 
--conf /home/centos01/modules/apache-flume-1.7.0-bin/conf/ 
--conf-file /home/centos01/modules/apache-flume-1.7.0-bin/conf/flume_push_streaming.conf  
-Dflume.root.logger=INFO,console

在这里遇到一点小问题，那就是使用telnet，要先开放端口，然后再启动telnet-server才能连接上

spark-submit上线部署

测试之后，就进入线上部署，先把flume的配置文件改成之前的linux01的hostname，然后用mvn clean package -DskipTests将sparkstreaming打成jar包，然后启动spark-submit

[centos01@linux01 spark-2.1.1-bin-hadoop2.7]$ spark-submit 
--name spark_flume 
--class com.fyj.spark.spark_flume 
--master local[*] 
--packages org.apache.spark:spark-streaming-flume_2.11:2.1.1 /home/centos01/modules/apache-flume-1.7.0-bin/test_dataSource/flume_spark/target/flume_spark-1.0-SNAPSHOT.jar 
linux01 41414

这个步骤有点bug，就不贴图了，很难受，昨天没有更新就是因为这个

在这里插入图片描述

目标二：Push-based Approach using a Custom Sink

与push的方式相反，是指sparkstreaming拉取过来信息，只需要让flume将数据push到一个buffer区，然后sparkstreaming就会使用一个合适的Flume receiver，从sink内拉出来，并且这个操作只会在数据被SparkStreaming完成副本和接收成功之后才会完成
所以这种方式比第一种更安全更可靠，支持容错很高。所以我们需要配置flume到一个自定义的sink上面

我们需要：使用一台机器运行flume agent ，然后用sparkstreaming去方位这台正在工作的自定义sink就ok了。

首先配置sink的jar包到SparkStreaming的pom文件上

		<dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-lang3</artifactId>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-flume-sink_2.11</artifactId>
            <version>2.1.1</version>
        </dependency>

配置Flume Agent Conf

# Name the components on this agent
simple-agent.sources = netcat-source
simple-agent.sinks = spark-sink
simple-agent.channels = memory-channel

# Describe/configure the source
simple-agent.sources.netcat-source.type = netcat
simple-agent.sources.netcat-source.bind = linux01
simple-agent.sources.netcat-source.port = 44444

# Describe the sink
simple-agent.sinks.spark-sink.type = org.apache.spark.streaming.flume.sink.SparkSink
simple-agent.sinks.spark-sink.hostname = linux01
simple-agent.sinks.spark-sink.port = 41414

# Use a channel which buffers events in memory
simple-agent.channels.memory-channel.type = memory
simple-agent.channels.memory-channel.capacity = 1000
simple-agent.channels.memory-channel.transactionCapacity = 100

# Bind the source and sink to the channel
simple-agent.sources.netcat-source.channels = memory-channel
simple-agent.sinks.spark-sink.channel = memory-channel

Configuration with sparkstreaming

在这里插入图片描述

2NaCl

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
SparkStreaming整合flume

文章目录目标一：Flume-style Push-based ApproachSparkStreaming整合flume有很多种方式，下面会一一列举这几个Demo，然后代码上传至github目标一：Flume-style Push-based Approach首先来看一下官方文档，之前所介绍的socket或者fileSystem都属于基本数据源，但是在这里，我们要主要介绍一下高级数据源。...
复制链接

扫一扫

专栏目录