数据过滤清洗数据
//textFile() 加载数据
val data = sc.textFile("/spark/seven.txt")
//filter 过滤长度小于0, 过滤不包含GET与POST的URL
val filtered = data.filter(_.length() > 0).filter(line => (line.indexOf("GET") > 0 || line.indexOf("POST") > 0))
//转换成键值对操作
val res = filtered.map(line => {
if (line.indexOf("GET") > 0) { //截取 GET 到URL的字符串
(line.substring(line.indexOf("GET"), line.indexOf("HTTP/1.0")).trim, 1)
} else { //截取 POST 到URL的字符串
(line.substring(line.indexOf("POST"), line.indexOf("HTTP/1.0")).trim, 1)
} //最后通过reduceByKey求sum
}).reduceByKey(_ + _)
//触发action事件执行
res.collect()
数据去重问题
原始数据
file1: | file2: |
2012-3-1 a 2012-3-2 b 2012-3-3 c 2012-3-4 d 2012-3-5 a 2012-3-6 b 2012-3-7 c 2012-3-3 c | 2012-3-1 b 2012-3-2 a 2012-3-3 b 2012-3-4 d 2012-3-5 a 2012-3-6 c 2012-3-7 d 2012-3-3 c |
数据输出:
2012-3-1 a 2012-3-1 b 2012-3-2 a 2012-3-2 b 2012-3-3 b 2012-3-3 c 2012-3-4 d 2012-3-5 a 2012-3-6 b 2012-3-6 c 2012-3-7 c 2012-3-7 d |
数据去重的最终目标是让原始数据中出现次数超过一次的数据在输出文件中只出现一次
val two = sc.textFile("/tmp/spark/two")
two.filter(_.trim().length>0).map(line=>(line.trim,"")).groupByKey().sortByKey().keys.collect.foreach(println _)
分析每年的最高温度
原始数据分析
0067011990999991950051507004888888889999999N9+00001+9999999999999999999999 0067011990999991950051512004888888889999999N9+00221+9999999999999999999999 0067011990999991950051518004888888889999999N9-00111+9999999999999999999999 0067011990999991949032412004888888889999999N9+01111+9999999999999999999999 0067011990999991950032418004888888880500001N9+00001+9999999999999999999999 0067011990999991950051507004888888880500001N9+00781+9999999999999999999999 |
数据说明:
第15-19个字符是year
第45-50位是温度表示,+表示零上 -表示零下,且温度的值不能是9999,9999表示异常数据
第50位值只能是0、1、4、5、9几个数字
val one = sc.textFile("/tmp/hadoop/one")
val yearAndTemp = one.filter(line => {
val quality = line.substring(50, 51);
var airTemperature = 0
if (line.charAt(45) == '+') {
airTemperature = line.substring(46, 50).toInt
} else {
airTemperature = line.substring(45, 50).toInt
}
airTemperature != 9999 && quality.matches("[01459]")
}).map {
line => {
val year = line.substring(15, 19)
var airTemperature = 0
if (line.charAt(45) == '+') {
airTemperature = line.substring(46, 50).toInt
} else {
airTemperature = line.substring(45, 50).toInt
}
(year, airTemperature)
}
}
val res = yearAndTemp.reduceByKey(
(x, y) => if (x > y) x else y
)
res.collect.foreach(x => println("year : " + x._1 + ", max : " + x._2))
}
数据排序
输入文件
file1: | file2: | file3: |
2 32 654 32 15 756 65223 | 5956 22 650 92 | 26 54 6 |
样例输出:
1 2 2 6 3 15 4 22 5 26 6 32 7 32 8 54 9 92 10 650 11 654 12 756 13 5956 14 65223 |
val three = sc.textFile("/tmp/spark/three",3)
var idx = 0
import org.apache.spark.HashPartitioner
//由入输入文件有多个,产生不同的分区,为了生产序号,使用HashPartitioner将中间的RDD归约到一起
val res = three.filter(_.trim().length>0).map(num=>(num.trim.toInt,"")).partitionBy(
new HashPartitioner(1)).sortByKey().map(t => {
idx += 1
(idx,t._1)
}).collect.foreach(x => println(x._1 +"\t" + x._2) )
平均成绩
原始数据
Math: | China: | English: |
张三 88 李四 99 王五 66 赵六 77 | 张三 78 李四 89 王五 96 赵六 67 | 张三 80 李四 82 王五 84 赵六 86 |
样本输出:
张三 82 李四 90 王五 82 赵六 76 |
val fourth = sc.textFile("/tmp/spark/fourth",3)
val res = fourth.filter(_.trim().length>0).map(line=>(line.split("\t")(0).trim(),line.split("\t")(1).trim().toInt)).groupByKey().map(x => {
var num = 0.0
var sum = 0
for(i <- x._2){
sum = sum + i
num = num +1
}
val avg = sum/num
val format = f"$avg%1.2f".toDouble
(x._1,format)
}).collect.foreach(x => println(x._1+"\t"+x._2))
求最大最小值问题
数据准备
eightteen_a.txt | eightteen_b.txt |
102 10 39 109 200 11 3 90 28 | 5 2 30 838 10005 |
结果预测
Max 10005 Min 2
val fifth = sc.textFile("/tmp/spark/fifth",3)
val res = fifth.filter(_.trim().length>0).map(line => ("key",line.trim.toInt)).groupByKey().map(x => {
var min = Integer.MAX_VALUE
var max = Integer.MIN_VALUE
for(num <- x._2){
if(num>max){
max = num
}
if(num<min){
min = num
}
}
(max,min)
}).collect.foreach(x => {
println("max\t"+x._1)
println("min\t"+x._2)
})
求最大的K个值并排序
需求分析
#orderid,userid,payment,productid
求topN的payment值
a.txt
1,9819,100,121 2,8918,2000,111 3,2813,1234,22 4,9100,10,1101 5,3210,490,111 6,1298,28,1211 7,1010,281,90 8,1818,9000,20
b.txt
100,3333,10,100 101,9321,1000,293 102,3881,701,20 103,6791,910,30 104,8888,11,39
预测结果:(求 Top N=5 的结果)
1 9000 2 2000 3 1234 4 1000 5 910
val six = sc.textFile("/tmp/spark/six")
var idx = 0;
val res = six.filter(x => (x.trim().length>0) && (x.split(",").length==4)).map(_.split(",")(2)).map(x => (x.toInt,"")).sortByKey(false).map(x=>x._1).take(5)
.foreach(x => {
idx = idx+1
println(idx +"\t"+x)})