需求:一个用户从第一次购买该剧开始,每天都购买或者未购买天数不超过1天,均视为连续追剧。未购买天数超过1天的均视为中断,需要重新统计连续追剧日期。请统计购买该剧超前点播的用户,最大的连续追剧天数,并按照最大连续追剧天数统计对应的用户数分布。
举例:
实现方式一:
常规思路: 按照时间dt排序,rank开窗rk,dt - rk = diff ,在根据userid,diff分组,使用count(*)的得到无一天中断的连续观剧次数cnt。这时如果有中断一天的记录之间,diff会是等差为1的数列,再次使用rank开窗rk2后,diff - rk = diff2,在根据user_id 和diff2分组,使用sum(cnt)得到,在加上组内间隔一天的天数count(*) - 1(例:3,5,7,9,中间间隔一天天数为3(4,6,8)).
select
id,
sum(ct)+(count(*)-1)
from
(select
id,
ct,
date_sub(dateSub1,rank() over(partition by id order by dateSub1)) dateSub2
from
(select
id,
dateSub1,
count(*) ct
from
(select
id,
date_sub(pdate,rank() over(partition by id order by pdate)) dateSub1
from
ten)t1
group by id,dateSub1)t2)t3
group by id,dateSub2;
实现方式二:
开窗取上一条记录的view_dt 做差,小于等于2 的为有效连续天数。大于2的中断,重新计算连续
思想将同一个用户中多个连续时间段分开,使用sum(dateDiff > 2, 1, 0))开窗可实现为每个连续时间段标记的效果。
sum(if(dateDiff > 2, 1, 0)) over (partition by user_id order by view_dt) flag
user2,2020-05-01,18383,1
user2,2020-05-02,1,0,1
user2,2020-05-03,1,0,1
user2,2020-05-05,2,0,1
user2,2020-05-07,2,0,1
user2,2020-05-09,2,0,1
user2,2020-05-11,2,0,1
user2,2020-05-14,3,1,2
user2,2020-05-15,1,0,2
user2,2020-05-16,1,0,2
user3,2020-05-01,18383,1
user3,2020-05-02,1,1
user3,2020-05-03,1,1
user3,2020-05-04,1,1
user3,2020-05-06,2,1
user3,2020-05-07,1,1
user3,2020-05-08,1,1
user3,2020-05-11,3,2
user3,2020-05-12,1,2
user3,2020-05-13,1,2
with t1 as (
select
user_id, view_dt,
datediff(view_dt,lag(view_dt,1,"1970-01-01") over (partition by user_id order by view_dt)) dateDiff
from tencentsql_userview
),t2 as (
select
user_id,
view_dt,
sum(if(dateDiff > 2, 1, 0)) over (partition by user_id order by view_dt) flag
from t1
)
select
user_id,
datediff(max(view_dt),min(view_dt)) +1 cnt
from
t2
group by user_id,flag
实现方式二: 读取hive中表的数据文件,使用SparkCore中的Rdd操作
def main(args: Array[String]): Unit = {
//1.创建SparkConf并设置App名称
val conf: SparkConf = new SparkConf().setAppName("SparkCoreTest").setMaster("local[*]")
//2.创建SparkContext,该对象是提交Spark App的入口
val sc: SparkContext = new SparkContext(conf)
//从hdfs中读取数据
val fileRdd: RDD[String] = sc.textFile("hdfs://hadoop102:8020/input")
val result: RDD[(String, Long)] = fileRdd.map(s => {
val fields: Array[String] = s.split("\t")
(fields(0), fields(1))
}).groupByKey().mapValues(iter => {
//处理每组的数据
var lastDt = "" //用于记录上次日期
var count = 1L //记录连续观看天数
val cList = new ListBuffer[Long]() //记录一个会员所有连续观看记录
//排序后遍历
iter.toList.sortWith(_ < _).foreach(dt => {
if (!lastDt.eq("")) {
val dt_diff: Long = dateDiff(lastDt, dt)
if (dt_diff > 2) {
//间隔两天没看视频,记录上段时间连续观看天数
cList.append(count)
count = 1; //归一,便于计算下个连续观看记录
} else {
//如果结果为1或者2,用当前dt代替lastPdate,同时count+(1,2)
count += dt_diff
}
}
lastDt = dt
})
cList.append(count) //记录最后一段时间联系观看天数
cList.max //取出最大连续
})
result.foreach(println)
result.map((_.swap)).groupByKey().mapValues(iter=>iter.size).foreach(println)
//5.关闭
sc.stop()
}
/**
* 计算两个字符串日期的间隔天数
* @param formDateStr
* @param endDateStr
* @return
*/
def dateDiff(formDateStr:String,endDateStr:String):Long={
val pattern = "yyyy-MM-dd"
val startDate: Date = new SimpleDateFormat(pattern).parse(formDateStr)
val endDate:Date = new SimpleDateFormat(pattern).parse(endDateStr)
val between:Long = endDate.getTime - startDate.getTime
val day = between / 1000 / 3600 / 24
day
}
有关大数据学习资源,请关注微信公众号“码农书斋”。回复“大数据”,免费获取学习视频、源码及资料!