案例:计算俄罗斯100多年的降水总量,并列出降水量最多的十年;
数据说明:【20674 1936 1 1 0 -28.0 0 -24.9 0 -20.4 0 0.0 2 0 OOOO】
0.气象站编码
1.年
2.月
3.日
4.空气温度质量标记
5.每日最低温度
6.每日最低温度标记:0表示正常,1表示是存疑,9表示异常或无观测值
7.每日平均温度
8.每日平均温度标记:0表示正常,1表示是存疑,9表示异常或无观测值
9.每日最高温度
10.每日最高温度标记:0表示正常,1表示是存疑,9表示异常或无观测值
11.每日降水量
12.每日降水量标记:0表示降水量超过0.1mm,1表示数天的统计量,2表示无观察值,2表示降水小于0.1mm
13.每日降水量标记:0表示正常,1表示是存疑,9表示异常或无观测值
14.数据标记:4位,AAAA表示使用新版数据规范;oooo表示数据比较值不变;RRRR表示数据比较值可变
解题思路与步骤:
step1:加载数据,进行数据清洗(过滤不符合条件的数据,包括:数据完整度、准确度);
val conf = new SparkConf().setAppName("降水量").setMaster("local[2]")
val sc = new SparkContext(conf)
val load_rdd = sc.textFile("file:///F:\\测试数据\\ussr\\f*");
val clean_rdd = load_rdd.map(x=>x.trim.replace(" "," ").replace(" "," ").split(" "))
.filter(x=>x.length==15)
.filter(x=>x(13)!="9")
.filter(x=>x(11)!="999.9")
.filter(x=>x(12)!="9")
step2: 获取年份和降水量,并将对年份分组,计算总降水量;
val year_jyl = clean_rdd.map(x=>(x(1),x(11).toDouble))
.groupByKey()
.map(x=>(x._1,x._2.reduce(_+_)))
step3:按降水量进行排序(降序操作)
val sort_jyl = year_jyl.map(x=>(x._2,x._1))
.sortByKey(false)
.map(x=>(x._2,x._1))
step4:获取降水量最多的前十年
val top10 = sort_jyl.repartition(1).zipWithIndex().filter(x=>x._2<10)
step5:将最终结果输出到HDFS中
top10.saveAsTextFile("file:///D://out9")
查看血统依赖关系图:
观察WebUI4040端口或todebugStirng
(1) MapPartitionsRDD[20] at filter at jsl_Demo.scala:19 []
| ZippedWithIndexRDD[19] at zipWithIndex at jsl_Demo.scala:19 []
| MapPartitionsRDD[18] at repartition at jsl_Demo.scala:19 []
| CoalescedRDD[17] at repartition at jsl_Demo.scala:19 []
| ShuffledRDD[16] at repartition at jsl_Demo.scala:19 []
+-(122) MapPartitionsRDD[15] at repartition at jsl_Demo.scala:19 []
| MapPartitionsRDD[14] at map at jsl_Demo.scala:18 []
| ShuffledRDD[13] at sortByKey at jsl_Demo.scala:17 []
+-(223) MapPartitionsRDD[10] at map at jsl_Demo.scala:16 []
| MapPartitionsRDD[9] at map at jsl_Demo.scala:15 []
| ShuffledRDD[8] at groupByKey at jsl_Demo.scala:14 []
+-(223) MapPartitionsRDD[7] at map at jsl_Demo.scala:13 []
| MapPartitionsRDD[6] at filter at jsl_Demo.scala:12 []
| MapPartitionsRDD[5] at filter at jsl_Demo.scala:11 []
| MapPartitionsRDD[4] at filter at jsl_Demo.scala:10 []
| MapPartitionsRDD[3] at filter at jsl_Demo.scala:9 []
| MapPartitionsRDD[2] at map at jsl_Demo.scala:8 []
| file:///F:\测试数据\ussr\f* MapPartitionsRDD[1] at textFile at jsl_Demo.scala:7 []
| file:///F:\测试数据\ussr\f* HadoopRDD[0] at textFile at jsl_Demo.scala:7 []