Spark Streaming整合Flume方式有两种
方式一:Flume-style Push-based Approach
pom文件依赖
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-flume_2.11</artifactId>
<version>2.0.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.1.1</version>
</dependency>
</dependencies>
<!-- 打包-->
<build>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>2.3.2</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
Push方式整合之Flume Agent配置
[root@hadoop1 conf]# vim flume_push_streaming.conf
//添加参数
simple-agent.sources = netcat-source
simple-agent.sinks = avro-sink
simple-agent.channel = memory-channel
simple-agent.sources.netcat-source.type = netcat
simple-agent.sources.netcat-source.bind = hadoop1.x
simple-agent.sources.netcat-source.port = 44444
simple-agent.sinks.avro-sink.type = avro
simple-agent.sinks.avro-sink.hostname = 192.168.126.171
simple-agent.sinks.avro-sink.port = 41414
simple-agent.channels.memory-channel.type = memory
simple-agent.sources.netcat-source.channels = memory-channel
simple-agent.sinks.avro-sink.channel = memory-channel
Spark Streaming应用,代码编写
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.flume.FlumeUtils
/**
* Spark Streaming整合Flume的第一种方式
*/
object FlumePushWordCount {
def main(args: Array[String]): Unit = {
val Array(hostname, port) = args
val sparkConf = new SparkConf() //.setMaster("local[2]").setAppName("FlumePushWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(5))
//TODO... 如何使用SparkStreaming整合Flume
val flumeStream = FlumeUtils.createStream(ssc, hostname, port.toInt)
flumeStream.map(x=> new String(x.event.getBody.array()).trim)
.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).print()
ssc.start()
ssc.awaitTermination()
}
}
传参步骤
虚拟机启动flume服务
[root@hadoop1 bin]# ./flume-ng agent --name simple-agent conf $FLUME_HOME/conf --conf-file $FLUME_HOME/conf/flume_push_streaming.conf -Dflume.root.logger=INFO,console
加载过程图:
运行程序,控制台打印
提交到生产环境中去,需要打包
自己把这个jar包上传到lib文件上
[root@hadoop1 lib]# rz -be
rz waiting to receive.
Starting zmodem transfer. Press Ctrl+C to cancel.
Transferring sparktrain-1.0-SNAPSHOT.jar...
100% 7 KB 7 KB/sec 00:00:01 0 Errors
[root@hadoop1 lib]# pwd
/home/hadoop/lib
[root@hadoop1 lib]# ll
total 8
-rw-r--r--. 1 root root 7608 Apr 12 13:43 sparktrain-1.0-SNAPSHOT.jar
[root@hadoop1 lib]#
//进程有
[root@hadoop1 spark]# jps
12801 ResourceManager
12930 NodeManager
12470 DataNode
12646 SecondaryNameNode
13750 Jps
12071 Application
12330 NameNode
[root@hadoop1 spark]# spark-submit --class com.imooc.spark.FlumePushWordCount --master local[2] --packages org.apache.spark:spark-streaming-flume_2.11:2.2.0 /home/hadoop/sparktrain-1.0-SNAPSHOT.jar hadoop1.x 41414
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/usr/local/etc/hadoop/module/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.spark#spark-streaming-flume_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
发生错误:
20/04/13 16:33:29 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://hadoop1.x:9000/directory
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:93)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:531)
at org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:836)
at org.apache.spark.streaming.StreamingContext.<init>(StreamingContext.scala:84)
at com.imooc.spark.FlumePushWordCount$.main(FlumePushWordCount.scala:15)
at com.imooc.spark.FlumePushWordCount.main(FlumePushWordCount.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:743)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
20/04/13 16:33:29 INFO ShutdownHookManager: Shutdown hook called
20/04/13 16:33:29 INFO ShutdownHookManager: Deleting directory /tmp/spark-1c4c0da4-e097-48c2-9e91-b97d81965a0a
待解决