1. 概述
本文描述的是 spark 学习的第二阶段知识点,主要目的是实现 spark streaming + flume + log4j 的实时统计。学习过程中遇到的坑不少,适当做个总结记录,方便以后查阅。
2. 端对端演示介绍
演示环境:cent OS 64bit 的虚拟机环境
流程描述:
A. log4jTest 输出产生 log4j.log 文件;
B. Flume 读取 log4j.log 文件,然后输出到 avro + hostname + port 中;
C. FlumeTest 中接收 flume avro sink 的数据并统计,打印到控制台和保存到hadoop dfs中;
3. 虚拟机配置
有两点需要重点注意:
a. 处理器数量默认好像是1,改成2!
b. 内存我的机器默认是 1G,好像spark装好后非常卡,就改成了1.5G,最后跑通演示时是改到了 2G
4. Log4jTest
4.1 代码App.java
5. package com.leon;
6.
7. import org.apache.log4j.*;
8.
9. /**
10. *Hello world!
11. *
12. */
13. public class App
14. {
15.
16. public static void main( String[] args ) throws InterruptedException
17. {
18. System.out.println("Hello World!" );
19.
20. //PropertyConfigurator.configure("D:\\workroom\\test\\java_workspace\\logTest\\logTest\\log4j.properties");
21. //PropertyConfigurator.configure("/mnt/hgfs/share/log4j.properties");
22. PropertyConfigurator.configure("/home/hadoop/spark/flume-spark-demo/log4j.properties");
23. Logger logger = Logger.getLogger(App.class );
24.
25. int dura, interval, i;
26. dura =6000*100;
27. interval =10;
28. i =0;
29. while (i<dura)
30. {
31. Stringstr = new String();
32. switch ((int)(Math.random()*5))
33. {
34. case 0:
35. str = "hello";
36. break;
37. case 1:
38. str = "world";
39. break;
40. case 2:
41. str = "spark";
42. break;
43. case 3:
44. str = "hadoop";
45. break;
46. case 4:
47. str = "flume";
48. break;
49. }
50. logger.info(str);
51. Thread.sleep(interval);
52. i+=interval;
53. }
54.
55. }
56. }
4.2 log4j.properties 配置
log4j.rootLogger=DEBUG,stdout, R
log4j.appender.stdout.Threshold= DEBUG
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d[%t] %-5p %c - %m%n
log4j.appender.R=org.apache.log4j.RollingFileAppender
log4j.appender.R.Threshold=INFO
log4j.appender.R.File=/mnt/hgfs/share/spark_dir/data/log4j.log
log4j.appender.R.MaxFileSize=100MB
log4j.appender.R.layout.ConversionPattern=%d[%t] %-5p %c - %m%n
log4j.appender.R.layout=org.apache.log4j.PatternLayout
4.3 说明
程序起来后,大约以每秒100条的速度向/mnt/hgfs/share/spark_dir/data/log4j.log 文件中写入日志信息。
运行命令行 start-log :
-> java-Djava.ext.dirs=/mnt/hgfs/share/scala_dir/jar/1.2.17/ -cp logTest-1.0.jarcom.leon.App
注意需要添加外部库:log4j-1.2.17.jar和 log4j-1.2.17-sources.jar
5.Flume 配置
5.1 配置文件 demo.conf
agent.sources=s1
agent.channels=c1
agent.sinks=k1
agent.sources.s1.type=exec
agent.sources.s1.channels=c1
agent.sources.s1.command= tail -F /mnt/hgfs/share/spark_dir/data/log4j.log
agent.channels.c1.type=file
agent.channels.c1.capacity=1000
agent.channels.c1.transactionCapacity=100
agent.sinks.k1.type=avro
agent.sinks.k1.hostname=localhost
agent.sinks.k1.port=11114
agent.sinks.k1.channel=c1
5.2 说明
从exec源 s1 把数据输出到接收器k1 (avro,localhost: 11114)
运行的脚本 start-flume
-> $FLUME_HOME/bin/flume-ngagent -n agent -c $FLUME_HOME/conf -f $FLUME_HOME/conf/demo.conf
6. FlumeTest
6.1 代码FlumeTest.scala
package com.leon
import org.apache.spark.streaming.flume._
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.Seconds
import org.apache.spark.storage.StorageLevel
object FlumeTest {
def main(args: Array[String]) {
if (args.length < 2) {
print("please enter host and port")
System.exit(1)
}
val sc = new SparkConf().setAppName("FlumeEventCount")
//创建StreamingContext,20秒一个批次
val ssc = new StreamingContext(sc, Seconds(20))
val hostname = args(0)
val port = args(1).toInt
val storageLevel = StorageLevel.MEMORY_AND_DISK
println(hostname + " " +port)
val flumeStream = FlumeUtils.createStream(ssc,hostname, port, storageLevel)
flumeStream.count().map(cnt => "Received " + cnt + " flume events." ).print()
flumeStream.count().saveAsTextFiles("./flume-spark-demo/")
//flumeStream.print()
//flumeStream.saveAsTextFiles("file://mnt/hgfs/share/spark_dir/flumeStream")
//开始运行
ssc.start()
//计算完毕退出
ssc.awaitTermination()
}
}
6.2 说明
运行脚本:
è $SPARK_HOME/bin/spark-submit --master local[2]--total-executor-cores 2 --class com.leon.FlumeTest --jars/mnt/hgfs/share/flume-spark-demo/jar/spark-streaming-flume-sink_2.11-1.6.0.jar,/mnt/hgfs/share/flume-spark-demo/jar/spark-streaming-flume_2.11-1.6.0.jar,/mnt/hgfs/share/flume-spark-demo/jar/flume-avro-source-1.5.0.jar,/mnt/hgfs/share/flume-spark-demo/jar/flume-ng-sdk-1.5.0.jar--executor-memory 512m /mnt/hgfs/share/flume-spark-demo/FlumeTest.jar localhost11114
7. 几个坑说明
7.1 虚拟机防火墙
建议关闭
7.2 FlumeTest job 不执行
错误现场截图:
这个问题查了很久,还专门在 spark 集群上去跑演示程序,发现local 模式下错误依然,在 http://bit1129.iteye.com/blog/2184467 中发现他也遇到这个问题,并且文中没有说明原因,也没有解决办法。其实,其实出错时候的脚本是:
è $SPARK_HOME/bin/spark-submit --master local --total-executor-cores 2--class com.leon.FlumeTest --jars/mnt/hgfs/share/flume-spark-demo/jar/spark-streaming-flume-sink_2.11-1.6.0.jar,/mnt/hgfs/share/flume-spark-demo/jar/spark-streaming-flume_2.11-1.6.0.jar,/mnt/hgfs/share/flume-spark-demo/jar/flume-avro-source-1.5.0.jar,/mnt/hgfs/share/flume-spark-demo/jar/flume-ng-sdk-1.5.0.jar--executor-memory 512m /mnt/hgfs/share/flume-spark-demo/FlumeTest.jar localhost11114
仔细比较下6.2 中的脚本就可以发现,问题出在 –master 的设置上,要设置为 local[2],原因我也不知道,反正设置为 local[2] 就解决了。
7.3 FlumeTest 运行时缺少库
必须用 –jars 把缺少的库加上。