做什么?
收到kafka的数据,实时统计各省各城市各广告的广告点击量
需求分析
kafka收到的数据仍然是需求六中的数据,思路也是相同的,即将数据变成(key,1L),再去改变总的数量
不同的地方是:
-
key现在变为 (date_province _city_adid)
-
数量的统计不能再用reduceByKey,而是改为sparkStreaming中的updateStateByKey,其作用是对当前批次的数据和以往的数据进行一个累加更新操作,从而避免一直查询数据库,(相似的还用window滑动操作)
步骤解析:
- 数据的获取和上个需求一样
- 更新数据为(key,1L)
val key2ProvinceCityDStream = adRealTimeFilterDStream.map{
case log =>
val logSplit = log.split(" ")
val timeStamp = logSplit(0).toLong
// dateKey : yy-mm-dd
val dateKey = DateUtils.formatDateKey(new Date(timeStamp))
val province = logSplit(1)
val city = logSplit(2)
val adid = logSplit(4)
val key = dateKey + "_" + province + "_" + city + "_" + adid
(key, 1L)
}
3.利用sparkStreaming的updateStateByKey进行对以往数据的累加
//使用updateStateByKey算子,维护数据的更新
val key2StateDStream = key2ProvinceCityDStream.updateStateByKey[Long]{
(values:Seq[Long], state:Option[Long])=>{
var newValues=state.getOrElse(0L);
for(v<-values)newValues+=v;
Some(newValues);
}
- 更新数据库
key2StateDStream.foreachRDD{
rdd => rdd.foreachPartition{
items =>
val adStatArray = new ArrayBuffer[AdStat]()
// key: date province city adid
for((key, count) <- items){
val keySplit = key.split("_")
val date = keySplit(0)
val province = keySplit(1)
val city = keySplit(2)
val adid = keySplit(3).toLong
adStatArray += AdStat(date, province, city, adid, count)
}
AdStatDAO.updateBatch(adStatArray.toArray)
adStatArray.foreach(println);
}
}
完整代码
def provinceCityClickStat(adRealTimeFilterDStream: DStream[String])={
val key2ProvinceCityDStream = adRealTimeFilterDStream.map{
case log =>
val logSplit = log.split(" ")
val timeStamp = logSplit(0).toLong
// dateKey : yy-mm-dd
val dateKey = DateUtils.formatDateKey(new Date(timeStamp))
val province = logSplit(1)
val city = logSplit(2)
val adid = logSplit(4)
val key = dateKey + "_" + province + "_" + city + "_" + adid
(key, 1L)
}
//使用updateStateByKey算子,维护数据的更新
val key2StateDStream = key2ProvinceCityDStream.updateStateByKey[Long]{
(values:Seq[Long], state:Option[Long])=>{
var newValues=state.getOrElse(0L);
for(v<-values)newValues+=v;
Some(newValues);
}
}
key2StateDStream.foreachRDD{
rdd => rdd.foreachPartition{
items =>
val adStatArray = new ArrayBuffer[AdStat]()
// key: date province city adid
for((key, count) <- items){
val keySplit = key.split("_")
val date = keySplit(0)
val province = keySplit(1)
val city = keySplit(2)
val adid = keySplit(3).toLong
adStatArray += AdStat(date, province, city, adid, count)
}
// AdStatDAO.updateBatch(adStatArray.toArray)
adStatArray.foreach(println);
}
}
}