Spark练习各个品来热度前10
一、需求和资源准备
提供的日志格式:
2019-07-17_95_26070e87-1ad7-49a3-8fb3-cc741facaddf_37_2019-07-17 00:00:02_手机_-1_-1_null_null_null_null_3
2019-07-17_95_26070e87-1ad7-49a3-8fb3-cc741facaddf_48_2019-07-17 00:00:10_null_16_98_null_null_null_null_19
2019-07-17_95_26070e87-1ad7-49a3-8fb3-cc741facaddf_6_2019-07-17 00:00:17_null_19_85_null_null_null_null_7
2019-07-17_38_6502cdc9-cf95-4b08-8854-f03a25baa917_29_2019-07-17 00:00:19_null_12_36_null_null_null_null_5
2019-07-17_38_6502cdc9-cf95-4b08-8854-f03a25baa917_22_2019-07-17 00:00:28_null_-1_-1_null_null_15,1,20,6,4_15,88,75_9
数据文件中每行数据采用下划线分隔数据
➢ 每一行数据表示用户的一次行为,这个行为只能是 4 种行为的一种
➢ 如果搜索关键字为 null,表示数据不是搜索数据
➢ 如果点击的品类 ID 和产品 ID 为-1,表示数据不是点击数据
➢ 针对于下单行为,一次可以下单多个商品,所以品类 ID 和产品 ID 可以是多个,id 之
间采用逗号分隔,如果本次不是下单行为,则数据采用 null 表示
需求:
按照每个品类的点击、下单、支付的量来统计热门品类top10
二、实现方式
1.最简单实现
代码如下(示例):
object CategroiesTop10 {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("cachAndPersist")
val sc = new SparkContext(sparkConf)
//读取数据
val orgRDD = sc.textFile("datas/user_visit_action.txt")
//1.过滤不是点击的数据
val clickRDD = orgRDD.filter {
case str => {
val datas = str.split("_")
datas(6) != "-1"
}
}
//1.把点击的数据转换结构,并形同品类做一个聚合
val clickCountRDD = clickRDD.map {
case str => {
val datas = str.split("_")
(datas(6), 1)
}
}.reduceByKey(_ + _)
//2.订单数据,过滤掉不是下单的数据
val orderRDD = orgRDD.filter {
case str => {
val datas = str.split("_")
datas(8) != "null"
}
}
//2.订单数据,转换结构,每个订单中有多个品类id,需要拆开,并获取每个订单中的一个,并统计
val orderCountRDD: RDD[(String, Int)] = orderRDD.flatMap {
case str => {
val datas = str.split("_")(8).split(",")
datas.map((_, 1))
}
}.reduceByKey(_+_)
//3.下单数据,过滤掉非下单数据
val payRDD = orgRDD.filter {
case str => {
val datas = str.split("_")
datas(10) != "null"
}
}
//3.单个下单数据中,可能有多个品类id,需要拆分
val payCountRDD = payRDD.flatMap {
case str => {
val datas = str.split("_")(10).split(",")
datas.map((_, 1))
}
}.reduceByKey(_ + _)
//join :会导致某些没有点击步出来
//leftjoin 同上
//zip :对分区数和分区内数量有要求
//cogroup
val coGroupRDD: RDD[(String, (Iterable[Int], Iterable[Int], Iterable[Int]))] = clickCountRDD.cogroup(orderCountRDD, payCountRDD)
val coGroupValueRDD = coGroupRDD.mapValues {
case (click, order, pay) => {
var cli = 0;
var ord = 0;
var pa = 0;
if (click.iterator.hasNext) {
cli = click.iterator.next()
}
if (order.iterator.hasNext) {
ord = order.iterator.next()
}
if (pay.iterator.hasNext) {
pa = pay.iterator.next()
}
(cli, ord, pa)
}
}
val resutlRDD = coGroupValueRDD.sortBy(_._2, false).take(10)
resutlRDD.foreach(println)
sc.stop()
}
}
存在问题:
1.orgRDD存在多次使用,会多次执行改RDD
2.cogroup可能会存在shuffle过程
2.把第一种方案优化
代码如下(示例):
object CategroiesTop10Test2 {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("cachAndPersist")
val sc = new SparkContext(sparkConf)
//读取数据
val orgRDD = sc.textFile("datas/user_visit_action.txt")
orgRDD.cache()
//1.过滤不是点击的数据
val clickRDD = orgRDD.filter {
case str => {
val datas = str.split("_")
datas(6) != "-1"
}
}
//1.把点击的数据转换结构,并形同品类做一个聚合
val clickCountRDD = clickRDD.map {
case str => {
val datas = str.split("_")
(datas(6), (1,0,0))
}
}
//2.订单数据,过滤掉不是下单的数据
val orderRDD = orgRDD.filter {
case str => {
val datas = str.split("_")
datas(8) != "null"
}
}
//2.订单数据,转换结构,每个订单中有多个品类id,需要拆开,并获取每个订单中的一个,并统计
val orderCountRDD= orderRDD.flatMap {
case str => {
val datas = str.split("_")(8).split(",")
datas.map((_, (0,1,0)))
}
}
//3.下单数据,过滤掉非下单数据
val payRDD = orgRDD.filter {
case str => {
val datas = str.split("_")
datas(10) != "null"
}
}
//3.单个下单数据中,可能有多个品类id,需要拆分
val payCountRDD = payRDD.flatMap {
case str => {
val datas = str.split("_")(10).split(",")
datas.map((_, (0,0,1)))
}
}
//union
val unionRDD: RDD[(String, (Int, Int, Int))] = clickCountRDD.union(payCountRDD).union(orderCountRDD)
val reduceRDD = unionRDD.reduceByKey {
case (o, t) => {
(o._1 + t._1, o._2 + t._2, o._3 + t._3)
}
}
val resultRDD = reduceRDD.sortBy(_._2, false)
resultRDD.foreach(println)
sc.stop()
}
}
在第一种实现方式,把源RDD做了缓存优化,把cogroup 换成union
存在问题:
多次使用源RDD,多次转换结构,一次性转换结构
3.把第二种方案优化
代码如下(示例):
object CategroiesTop10Test3 {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("cachAndPersist")
val sc = new SparkContext(sparkConf)
//读取数据
val orgRDD = sc.textFile("datas/user_visit_action.txt")
val flatMapRDD = orgRDD.flatMap {
case str => {
val datas = str.split("_")
if (datas(6) != "-1") {
List((datas(6), (1, 0, 0)))
} else if (datas(8) != "null") {
val strings = datas(8).split(",")
strings.map((_, (0, 1, 0)))
} else if (datas(10) != "null") {
val strings = datas(10).split(",")
strings.map((_, (0, 0, 1)))
} else {
Nil
}
}
}
val reduceRDD = flatMapRDD.reduceByKey {
case (o, t) => {
(o._1 + t._1, o._2 + t._2, o._3 + t._3)
}
}
val resultRDD = reduceRDD.sortBy(_._2, false).take(10)
resultRDD.foreach(println)
sc.stop()
}
}
把多次使用源RDD取消,使用一次
4.使用共享只写变量实现
object CategroiesTop10Test4 {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("cachAndPersist")
val sc = new SparkContext(sparkConf)
//读取数据
val orgRDD = sc.textFile("datas/user_visit_action.txt")
val cateGroyAcc=new CateGroyAccumulator()
sc.register(cateGroyAcc,"myCate")
orgRDD.foreach{
case str=>{
val datas = str.split("_")
if (datas(6) != "-1") {
cateGroyAcc.add("click",datas(6))
} else if (datas(8) != "null") {
val strings = datas(8).split(",")
strings.foreach(t=>{
cateGroyAcc.add("order",t)
})
} else if (datas(10) != "null") {
val strings = datas(10).split(",")
strings.foreach(cateGroyAcc.add("pay",_))
}
}
}
val grouieses: mutable.Iterable[CateGrouies] = cateGroyAcc.value.map(_._2)
val result = grouieses.toList.sortWith {
case (l, r) => {
if (l.click > r.click) {
true
} else if (l.click == r.click) {
if (l.order > r.order) {
true
} else if (l.order == r.order) {
if (l.pay > r.pay) {
true
} else {
false
}
} else {
false
}
} else {
false
}
}
}
val takeResult = result.take(10)
takeResult.foreach(println)
sc.stop()
}
//accumulator
class CateGroyAccumulator extends AccumulatorV2[(String,String),mutable.Map[String,CateGrouies]]{
private val cateMap=new mutable.HashMap[String,CateGrouies]
override def isZero: Boolean = cateMap.isEmpty
override def copy(): AccumulatorV2[(String, String), mutable.Map[String, CateGrouies]] = new CateGroyAccumulator
override def reset(): Unit = cateMap.clear()
override def add(v: (String, String)): Unit = {
val grouies = cateMap.getOrElse(v._2, CateGrouies(v._1, 0, 0, 0))
if (v._1 == "click") {
grouies.click+=1
}else if (v._1 == "order"){
grouies.order+=1
}else if (v._1 == "pay"){
grouies.pay+=1
}
cateMap.update(v._2,grouies)
}
override def merge(other: AccumulatorV2[(String, String), mutable.Map[String, CateGrouies]]): Unit = {
val oth: mutable.Map[String, CateGrouies] = other.value
oth.foreach{
case (key,value)=>{
val grouies: CateGrouies = cateMap.getOrElse(key, CateGrouies(key, 0, 0, 0))
grouies.click+=value.click
grouies.order+=value.order
grouies.pay+=value.pay
cateMap.update(key,grouies)
}
}
}
override def value: mutable.Map[String, CateGrouies] = cateMap
}
//case class
case class CateGrouies(var cid:String,var click:Int,var order:Int,var pay:Int){}
}