原始数据,
数据按顺序为timestamp,province,city,userid,adid
1562085629599 Hebei Shijiazhuang 564 1
1562085629621 Hunan Changsha 14 6
1562085629636 Hebei Zhangjiakou 265 9
1562085629653 Hunan Changsha 985 4
1562085629677 Jiangsu Nanjing 560 6
1562085629683 Hubei Jingzhou 274 2
1562085629699 Jiangsu Suzhou 29 5
1562085629704 Jiangsu Nanjing 759 3
1562085629706 Hunan Xiangtan 88 8
1562085629713 Hebei Zhangjiakou 102 9
1562085629715 Hebei Zhangjiakou 302 2
需求
统计每一个省份点击TOP3的广告ID
参考答案
(Hunan,List((5,2273), (1,2202), (2,2193)))
(Hebei,List((7,2250), (8,2240), (3,2234)))
解析
- 需求每个省点击top3,需要依据省份,以及广告Id分组
- 需要对每个省每个广告点击量求和
- 本题需要用到的数据有province,adid 其余都用不上
- top3 则需要对广告点击数排序.则需要把广告id和每个广告的点击数构造到同一个数组中, id和点击数绑定到一块,(元组形式)
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("count").setMaster("local")
val sc = new SparkContext(conf)
val file: RDD[String] = sc.textFile("D:\\mypc\\Phase03-02-Spark\\day04\\作业\\数据")
val arr = file.map(line => {
val lineArr = line.split("\t")
val province = lineArr(1)
val ad_id = lineArr(4)
((province, ad_id), 1)
})
// per(广告+id) 点击量
val sumed: RDD[((String, String), Int)] = arr.reduceByKey((x, y) => (x + y))
//每个省top3广告点击量 依据省分组,重组上述数据
val resumed: RDD[(String, (String, Int))] = sumed.map(line => {
val province = line._1._1
val adid = line._1._2
val proAdsum = line._2
(province, (adid, proAdsum))
})
//依据省聚合
val resumby_pro: RDD[(String, Iterable[(String, Int)])] = resumed.groupByKey()
//排序
val sorted = resumby_pro.mapValues(_.toList.sortWith((x, y) => (x._2 > y._2)).take(3))
val map: collection.Map[String, List[(String, Int)]] = sorted.collectAsMap()
println(map)
}