使用Pull方式的优点
Spark Stream集成Flume有两种方式:
- Push-based Approach
- Pull-based Approach
那么我们在生产中该选择哪种方式呢?让我们来看看官网是怎么说的
由此可见,Pull相较于Push的可靠性和容错性更好,所以我们采用Pull的方式进行集成。
配置Flume
-
配置pom.xml文件
<scala.version>2.11.12</scala.version> <spark.version>2.4.4</spark.version>
<!--Scala--> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-library</artifactId> <version>${scala.version}</version> <!--<scope>provided</scope>--> </dependency> <!-- Spark Stream --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming_2.11</artifactId> <version>${spark.version}</version> </dependency> <!-- Spark Stream Flume--> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-flume_2.11</artifactId> <version>${spark.version}</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-flume-sink_2.11</artifactId> <version>${spark.version}</version> </dependency> <dependency> <groupId>org.apache.commons</groupId> <artifactId>commons-lang3</artifactId> <version>3.5</version> </dependency>
-
配置Flume文件,name=
streaming-flume-integration-pull
integration-agent.sources = netcat_source integration-agent.sinks = spark_sink integration-agent.channels = memory_channel integration-agent.sources.netcat_source.type = netcat integration-agent.sources.netcat_source.bind = hlsijx integration-agent.sources.netcat_source.port = 9999 integration-agent.sinks.spark_sink.type = org.apache.spark.streaming.flume.sink.SparkSink integration-agent.sinks.spark_sink.hostname = hlsijx integration-agent.sinks.spark_sink.port = 11111 integration-agent.channels.memory_channel.type = memory integration-agent.sources.netcat_source.channels = memory_channel integration-agent.sinks.spark_sink.channel = memory_channel
配置Spark应用
-
创建Scala文件,名叫
FlumePullWordCount
package com.hlsijx.spark.stream.wordcount import org.apache.spark.SparkConf import org.apache.spark.streaming.flume.FlumeUtils import org.apache.spark.streaming.{Seconds, StreamingContext} object FlumePullWordCount { def main(args: Array[String]): Unit = { if (args.length != 2){ System.err.print("Usage: FlumePullWordCount <hostname> <port>") System.exit(1) } val Array(hostname, port) = args val sparkConf = new SparkConf() val ssc = new StreamingContext(sparkConf, Seconds(5)) val lines = FlumeUtils.createPollingStream(ssc, hostname, port.toInt) val wordCount = lines.map(x => new String(x.event.getBody.array()).trim) .flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _) wordCount.print() ssc.start() ssc.awaitTermination() } }
-
打包项目:
mvn clean package -DskipTests
,并上传到服务器路径:/data/spark-2.4.4-bin-2.6.0-cdh5.15.1/lib -
启动Flume
flume-ng agent \ --conf $FLUME_HOME/conf \ --conf-file $FLUME_HOME/conf/streaming-flume-integration-pull.conf \ --name integration-agent \ -Dflume.root.logger=INFO,console
-
启动Spark应用
bin/spark-submit \ --class com.hlsijx.spark.stream.wordcount.FlumePullWordCount \ --master local[2] \ --packages org.apache.spark:spark-streaming-flume_2.11:2.4.4 \ /data/spark-2.4.4-bin-2.6.0-cdh5.15.1/lib/spark-1.0.jar hlsijx 11111
测试
启动telnet:telnet hlsijx 9999
,输入a a a b,即可看到实时输出结果。