1. Spark Streaming 对接Flume
Flume是一个分布式、可靠、和高可用的海量日志采集、聚合和传输的系统。Flume可以采集文件,socket数据包等各种形式源数据,又可以将采集到的数据输出到HDFS、hbase、hive、kafka等众多外部存储系统中。以下将介绍Flume采集日志后直接对接Spark Streaming 两种方式 – push 和 poll。(使用的spark streaming版本为1.6.1, 使用flume版本为1.6.0)
1.1 Flume通过push方式对接Spark Streaming
- 在flume安装目录的conf文件下新建 flume-push.conf 配置文件,内容如下:
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# source
a1.sources.r1.type = spooldir
# 定义flume采集日志的监控目录
a1.sources.r1.spoolDir = /home/hadoop/flumespool
a1.sources.r1.fileHeader = true
# Describe the sink
a1.sinks.k1.type = avro
#这是接收方 即streamin本地运行的 ip地址 与 端口
a1.sinks.k1.hostname = 192.168.72.1
a1.sinks.k1.port = 8888
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
- 编写Streaming wordcount, 数据源为flume push到8888端口上的数据,并以本地模式运行Streaming wordcount
import org.apache.spark.streaming.flume.FlumeUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
object FlumePushWordCount {
def main(args: Array[String]): Unit = {
LoggerLevels.setStreamingLogLevels()
val conf = new SparkConf().setAppName("FlumePushWordCount").setMaster("local[2]")
val ssc = new StreamingContext(conf,Seconds(5))
//推送方式: flume向Spark发送数据
val flumeStream = FlumeUtils.createStream(ssc,"192.168.72.1",8888)
//flume中的数据通过event.getBody()才能拿到真正的内容
val words = flumeStream.flatMap( x => new String(x.event.getBody().array()).split(" ")).map((_,1))
val results = words.reduceByKey(_+_)
results.print()
ssc.start()
ssc.awaitTermination()
}
}
- 启动本地的Flume agent
/home/hadoop/apps/flume-1.6.0/bin/flume-ng agent -n a1 -c conf -f /home/hadoop/apps/flume-1.6.0/conf/flume-push.conf
- 新建测试的words.txt,内容如下,并放入到flume-push.conf中配置的flume监控目录 /home/hadoop/flumespool 下
hello tom
hello jerry
hello tom
hello kitty
hello world
1,laozhao,18
2,laoduan,30
3,laomao,28
- 观察控制台输出如下,flume成功向streaming推送采集到的words.txt,streaming成功完成计算。
1.2 Flume通过poll方式对接Spark Streaming
- 在flume安装目录的conf文件下新建 flume-poll.conf 配置文件,内容如下:
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# source
a1.sources.r1.type = spooldir
# flume监控日志目录
a1.sources.r1.spoolDir = /home/hadoop/flumespool
a1.sources.r1.fileHeader = true
# Describe the sink
a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink
# flume agent的ip地址和推送日志的端口
a1.sinks.k1.hostname = 192.168.72.128
a1.sinks.k1.port = 8888
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
- 由于(1)中配置的flume sink类为spark官方提供,需要在flume-1.6.0/lib/ 路径下放入 spark-streaming-flume-sink_2.10-1.6.1.jar 包,同时还需放入commons-lang3-3.3.2.jar scala-library-2.10.5.jar 包。
- 编写Streaming wordcount , 并本地模式运行
import java.net.InetSocketAddress
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.flume.FlumeUtils
object FlumePollWordCount {
def main(args: Array[String]): Unit = {
LoggerLevels.setStreamingLogLevels()
val conf = new SparkConf().setAppName("FlumePollWordCount").setMaster("local[2]")
val ssc = new StreamingContext(conf,Seconds(5))
//推送方式: 从flume中拉取数据,flume-poll.conf 中配置的ip和端口
val address = Seq(new InetSocketAddress("192.168.72.128",8888))
val flumeStream = FlumeUtils.createPollingStream(ssc,address,StorageLevel.MEMORY_AND_DISK)
//flume中的数据通过event.getBody()才能拿到真正的内容
val words = flumeStream.flatMap( x => new String(x.event.getBody().array()).split(" ")).map((_,1))
val results = words.reduceByKey(_+_)
results.print()
ssc.start()
ssc.awaitTermination()
}
}
- 启动flume agent
/home/hadoop/apps/flume-1.6.0/bin/flume-ng agent -n a1 -c conf -f /home/hadoop/apps/flume-1.6.0/conf/flume-poll.conf
- 新建测试的words.txt,内容如下,并放入到flume-push.conf中配置的flume监控目录 /home/hadoop/flumespool 下
hadoop spark sqoop hadoop spark hive hadoop
- 观察控制台输出如下,streaming成功从flume agent中拉取到words.txt,并完成计算。