步骤一: 读取hdfs上存储的气象数据
val rddall = sc.textFile("hdfs://hadoop01:9000/ncdc/197*/*")
rddall: org.apache.spark.rdd.RDD[String] = hdfs://hadoop01:9000/ncdc/* MapPartitionsRDD[93] at textFile at <console>:24
步骤二:rdd命令获取map,记录各年份不等于9999的气温,保存
scala> val result = map(x=>(x.substring(15,19),{if((x.substring(92,93)).matches("[01459]")){if (x.substring(87,88)=="+"){if(x.substring(88,92)!="9999"){x.substring(88,92)}else{("")}}else {x.substring(87,92)}}else{(" ")}}))
result: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[94] at map at <console>:26
步骤三:对结果进行reduceByKey,获取最高气温
scala> val resultAll = result.reduceByKey((x,y)=>({if(x>y) x else y})).collect
当然,这三步骤也可以合三为一
val rddall = sc.textFile("hdfs://hadoop01:9000/ncdc/197*/*").m