clickhouse之连续消费用户和漏斗函数案例实践

目录

 

本篇主要内容

一、clickhouse连续消费用户的sql分析

二、使用漏斗函数 windowFunnel,分析用户操作行为,计算转化率

clickhouse连续消费用户的sql分析

使用漏斗函数 windowFunnel,分析用户操作行为,计算转化率


本篇主要内容

一、clickhouse连续消费用户的sql分析

二、使用漏斗函数 windowFunnel,分析用户操作行为,计算转化率

clickhouse连续消费用户的sql分析

1)准备好数据   在本地  linux01   /data/user.csv

a,2017-02-05,200
a,2017-02-06,300
a,2017-02-07,200
a,2017-02-08,400
a,2017-02-10,600
b,2017-02-05,200
b,2017-02-06,300
b,2017-02-08,200
b,2017-02-09,400
b,2017-02-10,600
c,2017-01-31,200
c,2017-02-01,300
c,2017-02-02,200
c,2017-02-03,400
c,2017-02-10,600
a,2017-03-01,200
a,2017-03-02,300
a,2017-03-03,200
a,2017-03-04,400
a,2017-03-05,600

2)clickhouse 中创建表   【表引擎为MergeTree ,指定主键为name,时间,排序字段为name,时间】

create table tb_shop
(name String,
ctime Date,
money Float64
)engine=MergeTree
primary key(name,ctime)
order by(name,ctime);

3)导入数据   

clickhouse-client -q "INSERT INTO default.tb_shop FORMAT CSV" < user.csv

也可以使用     cat user.csv | clickhouse-client -q "INSERT INTO default.tb_shop FORMAT CSV";  导入数据

注:需要在本地linux01 /data/  下 执行该命令

4)分组,查询每个人 对应的消费时间(数组)、编号数组

select name,
groupArray(ctime) arr,
arrayEnumerate(arr)  arr2
from tb_shop
group by name;

┌─name─┬─arr─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─arr2───────────────────┐
│ b    │ ['2017-02-05','2017-02-06','2017-02-08','2017-02-09','2017-02-10']                                                                  │ [1,2,3,4,5]            │
│ c    │ ['2017-01-31','2017-02-01','2017-02-02','2017-02-03','2017-02-10']                                                                  │ [1,2,3,4,5]            │
│ a    │ ['2017-02-05','2017-02-06','2017-02-07','2017-02-08','2017-02-10','2017-03-01','2017-03-02','2017-03-03','2017-03-04','2017-03-05'] │ [1,2,3,4,5,6,7,8,9,10] │
└──────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────────────────────┘

5)使用array  join  连续连接  arr和arr2 进行多行拼接

select
name,
t,
diff
from(
select
name,
arr,
arr2
from
(
select name,
groupArray(ctime) arr,
arrayEnumerate(arr)  arr2
from tb_shop
group by name))
array join arr as t,
arr2 as diff;

6)求出ctime和编号之差,

并通过用户名和差dif 分组,聚合求出次数

select 
name,
addDays(t, -diff) as dif,
count(1) as cnt
from
(
select
name,
t,
diff
from(
select
name,
arr,
arr2
from
(
select name,
groupArray(ctime) arr,
arrayEnumerate(arr)  arr2
from tb_shop
group by name))
array join arr as t,
arr2 as diff) group by name,dif;

7)再分组聚合,求出最大的连续消费天数

select 
name,
max(cnt)
from(
select 
name,
addDays(t, -diff) as dif,
count(1) as cnt
from
(
select
name,
t,
diff
from(
select
name,
arr,
arr2
from
(
select name,
groupArray(ctime) arr,
arrayEnumerate(arr)  arr2
from tb_shop
group by name))
array join arr as t,
arr2 as diff) group by name,dif) group by name;

本案例要点:

  • clickhouse中创建表一定要指定引擎。
  • MergeTree表引擎,需要指定主键,本案例中,指定主键为name和ctime,排序字段为name和ctime,方便后续数据处理。记住加上小括号,否则报错
  • 在hive中编号函数为row_number()  加上窗口函数   over(partition by name order by ctime) 就可以对每个用户组的每行数据进行编号;而在clickhouse中,arrayEnumerate(arr)可以求出为数组中的每个元素编上号,但结果也是个数组。groupArray(ctime) arr,求出的也是时间的数组,我们使用array join使其两个数组进行炸裂拼接,得到多行对应数据。
  • 编号不是目的,目的是为了得到时间和编号之间的差,方便依据其进行分组得到连续性数据。

使用漏斗函数 windowFunnel,分析用户操作行为,计算转化率

漏斗函数

现有用户行为采集日志数据

现需要对其进行初步采集整合到ODS原始数据层,并且需要求出uv(单一用户访问量)和 用户 事件相关的一些数据统计和计算,比如转化率(完成整个流程【最终交易】的人数占总人数的比例)

1)创建hive表,导入数据

由于所采集静态数据为json格式,不方便操作,我们将其读取,存入到hive中,再将其导出为CSV格式

准备数据

启动hive元数据服务

hive  --service metastore &

启动hive客户端

hive

开启hive2服务

hiveserver2

启动连接beeline客户端

创建hive表

create  table  tb_log(
log string 

partitioned  by(dt string);

导入数据到hive表中     

load  data  local  inpath  "/data/log.log" into table tb_log partition(dt='20201007');

+----------------------------------------------------+------------+
|                     tb_log.log                     | tb_log.dt  |
+----------------------------------------------------+------------+
| {"account":"l4d9TM","appId":"cn.doitedu.app1","appVersion":"2.0","carrier":"小米移动","deviceId":"KVlABkbMqqwn","deviceType":"REDMI-5","eventId":"adClick","ip":"180.62.91.180","latitude":25.213352855705107,"longitude":100.58718180057615,"netType":"WIFI","osName":"android","osVersion":"6.5","properties":{"adCampain":"7","adId":"6","adLocation":"3","pageId":"41"},"releaseChannel":"豌豆荚","resolution":"1024*768","sessionId":"rGeXt8N0rD1","timeStamp":1602063058368} | 20201007   |
| {"account":"DxL36Dom","appId":"cn.doitedu.app1","appVersion":"2.2","carrier":"中国联通","deviceId":"0HtfcnPofgoR","deviceType":"REDMI-5","eventId":"fetchCoupon","ip":"160.171.79.244","latitude":34.21346724565028,"longitude":117.60034950493103,"netType":"4G","osName":"android","osVersion":"6.5","properties":{"couponId":"2","pageId":"393"},"releaseChannel":"百度手机助手","resolution":"1024*768","sessionId":"meO4lXHYmx1","timeStamp":1602063059899} | 20201007   |
+----------------------------------------------------+------------+

2)初步处理数据、创建原始数据层表   tb_ods_log

create  table  tb_ods_log as

select 
if(account='' ,deviceId , account) as guid, 
*
from
(select 
json_tuple(
log , 
'account' ,'appId' ,'appVersion','carrier','deviceId','deviceType','eventId','ip','latitude','longitude','netType','osName','osVersion','properties','releaseChannel','resolution','sessionId' ,'timeStamp') 
as 
(account ,appId ,appVersion,carrier,deviceId,deviceType,eventId,ip,latitude,longitude,netType,osName,osVersion,properties,releaseChannel,resolution,sessionId ,`timeStamp`)
from
tb_log)t
where account != ''  or  deviceId !='' ;

select * from tb_ods_log limit 2;

3)导出数据到本地    tsv 格式属性之间以tab键符号隔开

insert overwrite local directory '/data/'
row format delimited fields terminated by '\t'
select * from tb_ods_log;

4)创建clickhouse  ods表

原始数据已经初步处理了一下,将json串解析成为了字段,并添加了账号标识guid

clickhouse-client  打开客户端

建表
create table tb_ods_log(
guid String,
account String,
appId String,
appVersion String ,
carrier String,
deviceId String,
deviceType String,
eventId String,
ip String,
latitude String,
longitude String,
netType String,
osName String,
osVersion String,
properties String,
releaseChannel String,
resolution String,
sessionId String,
`timeStamp` String 
)engine=MergeTree
order by guid;

导入数据

cat 000000_0 | clickhouse-client -q "insert into tb_ods_log FORMAT TSV";

select * from tb_ods_log limit 4;

5)创建需求主题表

CREATE TABLE funnel_test ( uid String, eventId String, eventTime UInt64) 
ENGINE = MergeTree 
PARTITION BY (uid, eventTime) 
ORDER BY (uid, eventTime);

导入数据

clickhouse-client -q "insert into funnel_test FORMAT TSV" < event.data

event.dada  数据如下:

uid1    event1    1551398404
uid1    event2    1551398406
uid1    event3    1551398408
uid2    event2    1551398412
uid2    event3    1551398415
uid3    event3    1551398417
uid3    event4    1551398419

6)链条漏斗sql实践

select uid,
windowFunnel(4)(toDateTime(eventTime),eventId='event1',eventId='event2',eventId='event3') as funnel
from 
funnel_test
group by uid;

┌─uid──┬─funnel─┐
│ uid3 │      0 │
│ uid1 │      3 │
│ uid2 │      0 │
└──────┴────────┘

解析:windowFunnel(n)表示在n毫秒内  执行了步骤事件1、事件2、事件3 的步数  ,如上述查询显示,uid1用户在4毫秒内执行到了第四步,其他人一步也没有执行,这也叫做批处理、流处理过程。

可以用来分析商城注册用户从浏览广告、查看详情、加入购物车、提交订单、支付成功步骤的一个用户执行情况,可以计算转换率,从而为后期的运营进行一个指导。

7)创建clickhouse log2

create table tb_ods_log2 engine=MergeTree order by (guid,`timeStamp`)
as select * from tb_ods_log

8)执行漏斗函数

select `all`,
four,
four+three,
four+three+two,
four+three+two+one
from 
(
select
count(1) as `all`,
sum(if(funnel=4,1,0)) as four,
sum(if(funnel=3,1,0)) as three,
sum(if(funnel=2,1,0)) as two,
sum(if(funnel=1,1,0)) as one
from(
select guid,
windowFunnel(1000000)(toDateTime(cast(`timeStamp` as Int64)),
eventId='adShow' ,
eventId='adClick',
eventId='adCart',
eventId='submitOrder' 
) as funnel
from tb_ods_log2
group by guid));

更多学习、面试资料尽在微信公众号:Hadoop大数据开发

  • 1
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值