一、Streaming与Flume的联调
Spark 2.2.0 对应于 Flume 1.6.0
两种模式:
1. Flume-style push-based approach:
Flume推送数据給Streaming
Streaming的receiver作为Flume的Avro agent
Spark workers应该跑在Flume这台机器上
Streaming先启动,receiver监听Flume push data的端口
实现:
写flume配置文件:
netcat source -> memory channel -> avro sink
IDEA开发:
添加Spark-flume依赖
对应的API是FlumeUtils
开发代码:
importorg.apache.spark.SparkConfimportorg.apache.spark.streaming.flume.FlumeUtilsimportorg.apache.spark.streaming.{Seconds, StreamingContext}/** Spark Streaming整合Flume的第一种方式
**/object FlumePushWordCount {
def main(args: Array[String]): Unit={//外部传入参数
if (args.length != 2) {
System.out.println("Usage: FlumePushWordCount ")
System.exit(1)
}
val Array(hostname, port)= args //外部args数组
val sparkConf= new SparkConf().setMaster("local[2]").setAppName("FlumePushWordCount")
val ssc= new StreamingContext(sparkConf, Seconds(5))//选择输入ssc的createStream方法,生成一个InputDStream
val flumeStream =FlumeUtils.createStream(ssc, hostname, port.toInt)//由于flume的内容有head有body, 需要先把内容拿出来, 并去掉空值
flumeStream.map(x => newString(x.event.getBody.array()).trim)
.flatMap(x=> x.split(" ")).map(x => (x, 1)).