1、pom.xml:
注意:此处spark-streaming的依赖试了很多版本,都不好用,最终调试成功的为下面代码中所使用的版本。
<dependencies> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming_2.10</artifactId> <version>1.5.2</version> </dependency> </dependencies> <build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>2.3</version> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> <configuration> <filters> <filter> <artifact>*:*</artifact> <excludes> <exclude>META-INF/*.SF</exclude> <exclude>META-INF/*.DSA</exclude> </excludes> </filter> </filters> <transformers> <transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer"> <resource>META-INF/spring.handlers</resource> </transformer> <transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer"> <resource>META-INF/spring.schemas</resource> </transformer> <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer"> <mainClass></mainClass> </transformer> </transformers> <createDependencyReducedPom>JavaStreaming</createDependencyReducedPom> </configuration> </execution> </executions> </plugin> </plugins> </build>
2、spark streaming 实现 wordCount On Line:
import java.util.Arrays; import org.apache.spark.*; import org.apache.spark.api.java.function.*; import org.apache.spark.streaming.*; import org.apache.spark.streaming.api.java.*; import scala.Tuple2; import org.slf4j.Logger; import org.slf4j.LoggerFactory; public class JavaStreaming { private static final Logger logger = LoggerFactory.getLogger(JavaStreaming.class); public static void main(String[] args) throws InterruptedException { logger.info("start.."); SparkConf conf = new SparkConf().setAppName("wordCountOnline"); JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(10)); ; JavaReceiverInputDStream<String> lines = jssc.socketTextStream(ip地址, 端口号); //创建Spark Streaming输入数据来源 lines.count().print(); //遍历每一行,将每一行分割单词返回String的Iterable System.out.println("words FlatMapFunction"); JavaDStream<String> words = lines.flatMap(new FlatMapFunction<String, String>() { private static final long serialVersionUID = 1L; public Iterable<String> call(String line) { return Arrays.asList(line.split(" ")); } }); //每个单词计数标为1 JavaPairDStream<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() { private static final long serialVersionUID = 1L; public Tuple2<String, Integer> call(String word) throws Exception { return new Tuple2<String, Integer>(word, 1); } }); //相同单词的计数标记相加 JavaPairDStream<String, Integer> word_count = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() { private static final long serialVersionUID = 1L; public Integer call(Integer v1, Integer v2) { return v1 + v2; } }); word_count.count().print(); word_count.print(); jssc.start(); jssc.awaitTermination(); } }
(1)上述 spark streaming 的程序在执行时,并不是按照从上到下的顺序依次实现,而是分布式实现;
(2)端口号一开始试了9999,发现被占用,后改用7777,成功;
(3)上述实时计算只有在监控的终端发生变化时才会执行,且统计的是变化的部分;
(4)如果想改为实时监控某一文件夹,可将语句 JavaReceiverInputDStream<String> lines = jssc.socketTextStream(ip地址, 端口号) 改为:
JavaDStream<String> lines = jssc.textFileStream("hdfs:///user/data");
此时,即为实时监控目标文件夹下的变化,同理,只有该文件夹下的文件增加时,才回执行上述wordcount的过程,对新增加的文件进行实时计算;
注意,要使用绝对路径!文件夹下的文件格式要一致!
3、spark job 提交及启动:
spark-submit --class JavaStreaming(主类) --master yarn-cluster spark-streaming-0.0.1-SNAPSHOT.jar(jar包名)
查看spark-job:
yarn application -list
终止spark-job:
yarn application -kill jobname