Spark(七) --一文带你了解Spark Streaming对接Flume

最新推荐文章于 2023-01-16 08:07:32 发布

众里寻她千百回

最新推荐文章于 2023-01-16 08:07:32 发布

阅读量538

点赞数

分类专栏： Spark 文章标签： spark flume streaming push poll

本文链接：https://blog.csdn.net/JeremyIverson/article/details/106150433

版权

Spark 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

1. Spark Streaming 对接Flume

Flume是一个分布式、可靠、和高可用的海量日志采集、聚合和传输的系统。Flume可以采集文件，socket数据包等各种形式源数据，又可以将采集到的数据输出到HDFS、hbase、hive、kafka等众多外部存储系统中。以下将介绍Flume采集日志后直接对接Spark Streaming 两种方式 – push 和 poll。（使用的spark streaming版本为1.6.1，使用flume版本为1.6.0）

1.1 Flume通过push方式对接Spark Streaming

在flume安装目录的conf文件下新建 flume-push.conf 配置文件，内容如下：

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# source 
a1.sources.r1.type = spooldir
# 定义flume采集日志的监控目录
a1.sources.r1.spoolDir = /home/hadoop/flumespool
a1.sources.r1.fileHeader = true

# Describe the sink
a1.sinks.k1.type = avro
#这是接收方 即streamin本地运行的 ip地址 与 端口
a1.sinks.k1.hostname = 192.168.72.1
a1.sinks.k1.port = 8888

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

编写Streaming wordcount，数据源为flume push到8888端口上的数据，并以本地模式运行Streaming wordcount

import org.apache.spark.streaming.flume.FlumeUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}

object FlumePushWordCount {
  def main(args: Array[String]): Unit = {
    LoggerLevels.setStreamingLogLevels()
    val conf = new SparkConf().setAppName("FlumePushWordCount").setMaster("local[2]")
    val ssc = new StreamingContext(conf,Seconds(5))
    //推送方式: flume向Spark发送数据
    val flumeStream = FlumeUtils.createStream(ssc,"192.168.72.1",8888)
    //flume中的数据通过event.getBody()才能拿到真正的内容
    val words = flumeStream.flatMap( x => new String(x.event.getBody().array()).split(" ")).map((_,1))
    val results  = words.reduceByKey(_+_)
    results.print()
    ssc.start()
    ssc.awaitTermination()

  }

}

启动本地的Flume agent

/home/hadoop/apps/flume-1.6.0/bin/flume-ng agent -n a1 -c conf -f /home/hadoop/apps/flume-1.6.0/conf/flume-push.conf

新建测试的words.txt，内容如下，并放入到flume-push.conf中配置的flume监控目录 /home/hadoop/flumespool 下

hello tom
hello jerry
hello tom
hello kitty
hello world
1,laozhao,18
2,laoduan,30
3,laomao,28

观察控制台输出如下，flume成功向streaming推送采集到的words.txt，streaming成功完成计算。

1.2 Flume通过poll方式对接Spark Streaming

在flume安装目录的conf文件下新建 flume-poll.conf 配置文件，内容如下：

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# source
a1.sources.r1.type = spooldir
# flume监控日志目录
a1.sources.r1.spoolDir = /home/hadoop/flumespool
a1.sources.r1.fileHeader = true

# Describe the sink
a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink
# flume agent的ip地址和推送日志的端口
a1.sinks.k1.hostname = 192.168.72.128
a1.sinks.k1.port = 8888

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

由于(1)中配置的flume sink类为spark官方提供，需要在flume-1.6.0/lib/ 路径下放入 spark-streaming-flume-sink_2.10-1.6.1.jar 包，同时还需放入commons-lang3-3.3.2.jar scala-library-2.10.5.jar 包。
编写Streaming wordcount ，并本地模式运行

import java.net.InetSocketAddress

import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.flume.FlumeUtils

object FlumePollWordCount {
  def main(args: Array[String]): Unit = {
    LoggerLevels.setStreamingLogLevels()
    val conf = new SparkConf().setAppName("FlumePollWordCount").setMaster("local[2]")
    val ssc = new StreamingContext(conf,Seconds(5))
    //推送方式: 从flume中拉取数据，flume-poll.conf 中配置的ip和端口
    val address = Seq(new InetSocketAddress("192.168.72.128",8888))
    val flumeStream = FlumeUtils.createPollingStream(ssc,address,StorageLevel.MEMORY_AND_DISK)
    //flume中的数据通过event.getBody()才能拿到真正的内容
    val words = flumeStream.flatMap( x => new String(x.event.getBody().array()).split(" ")).map((_,1))
    val results  = words.reduceByKey(_+_)
    results.print()
    ssc.start()
    ssc.awaitTermination()
  }

}

启动flume agent

/home/hadoop/apps/flume-1.6.0/bin/flume-ng agent -n a1 -c conf -f /home/hadoop/apps/flume-1.6.0/conf/flume-poll.conf

新建测试的words.txt，内容如下，并放入到flume-push.conf中配置的flume监控目录 /home/hadoop/flumespool 下

hadoop spark sqoop hadoop spark hive hadoop

观察控制台输出如下，streaming成功从flume agent中拉取到words.txt，并完成计算。

众里寻她千百回

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录