clickhouse之连续消费用户和漏斗函数案例实践

最新推荐文章于 2024-05-27 11:45:59 发布

菜鸟周星星

最新推荐文章于 2024-05-27 11:45:59 发布

阅读量1.2k

点赞数 1

分类专栏： clickhouse 文章标签： clickhouse 漏斗函数流处理大数据开发

本文链接：https://blog.csdn.net/weixin_42083008/article/details/110825350

版权

clickhouse 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

本篇主要内容

一、clickhouse连续消费用户的sql分析

二、使用漏斗函数 windowFunnel，分析用户操作行为，计算转化率

clickhouse连续消费用户的sql分析

使用漏斗函数 windowFunnel，分析用户操作行为，计算转化率

本篇主要内容

一、clickhouse连续消费用户的sql分析

二、使用漏斗函数 windowFunnel，分析用户操作行为，计算转化率

clickhouse连续消费用户的sql分析

1）准备好数据在本地 linux01 /data/user.csv

a,2017-02-05,200
a,2017-02-06,300
a,2017-02-07,200
a,2017-02-08,400
a,2017-02-10,600
b,2017-02-05,200
b,2017-02-06,300
b,2017-02-08,200
b,2017-02-09,400
b,2017-02-10,600
c,2017-01-31,200
c,2017-02-01,300
c,2017-02-02,200
c,2017-02-03,400
c,2017-02-10,600
a,2017-03-01,200
a,2017-03-02,300
a,2017-03-03,200
a,2017-03-04,400
a,2017-03-05,600

2）clickhouse 中创建表【表引擎为MergeTree ，指定主键为name，时间，排序字段为name，时间】

create table tb_shop
(name String,
ctime Date,
money Float64
)engine=MergeTree
primary key(name,ctime)
order by(name,ctime);

3）导入数据

clickhouse-client -q "INSERT INTO default.tb_shop FORMAT CSV" < user.csv

也可以使用 cat user.csv | clickhouse-client -q "INSERT INTO default.tb_shop FORMAT CSV"; 导入数据

注：需要在本地linux01 /data/ 下执行该命令

4）分组，查询每个人对应的消费时间（数组）、编号数组

select name,
groupArray(ctime) arr,
arrayEnumerate(arr) arr2
from tb_shop
group by name；

┌─name─┬─arr─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─arr2───────────────────┐
│ b │ ['2017-02-05','2017-02-06','2017-02-08','2017-02-09','2017-02-10'] │ [1,2,3,4,5] │
│ c │ ['2017-01-31','2017-02-01','2017-02-02','2017-02-03','2017-02-10'] │ [1,2,3,4,5] │
│ a │ ['2017-02-05','2017-02-06','2017-02-07','2017-02-08','2017-02-10','2017-03-01','2017-03-02','2017-03-03','2017-03-04','2017-03-05'] │ [1,2,3,4,5,6,7,8,9,10] │
└──────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────────────────────┘

5）使用array join 连续连接 arr和arr2 进行多行拼接

select
name,
t,
diff
from(
select
name,
arr,
arr2
from
(
select name,
groupArray(ctime) arr,
arrayEnumerate(arr) arr2
from tb_shop
group by name))
array join arr as t,
arr2 as diff；

6）求出ctime和编号之差，

并通过用户名和差dif 分组，聚合求出次数

select
name,
addDays(t, -diff) as dif,
count(1) as cnt
from
(
select
name,
t,
diff
from(
select
name,
arr,
arr2
from
(
select name,
groupArray(ctime) arr,
arrayEnumerate(arr) arr2
from tb_shop
group by name))
array join arr as t,
arr2 as diff) group by name,dif；

7）再分组聚合，求出最大的连续消费天数

select
name,
max(cnt)
from(
select
name,
addDays(t, -diff) as dif,
count(1) as cnt
from
(
select
name,
t,
diff
from(
select
name,
arr,
arr2
from
(
select name,
groupArray(ctime) arr,
arrayEnumerate(arr) arr2
from tb_shop
group by name))
array join arr as t,
arr2 as diff) group by name,dif) group by name;

本案例要点：

clickhouse中创建表一定要指定引擎。
MergeTree表引擎，需要指定主键，本案例中，指定主键为name和ctime，排序字段为name和ctime，方便后续数据处理。记住加上小括号，否则报错。
在hive中编号函数为row_number() 加上窗口函数 over(partition by name order by ctime) 就可以对每个用户组的每行数据进行编号；而在clickhouse中，arrayEnumerate（arr）可以求出为数组中的每个元素编上号，但结果也是个数组。groupArray(ctime) arr，求出的也是时间的数组，我们使用array join使其两个数组进行炸裂拼接，得到多行对应数据。
编号不是目的，目的是为了得到时间和编号之间的差，方便依据其进行分组得到连续性数据。

使用漏斗函数 windowFunnel，分析用户操作行为，计算转化率

漏斗函数

现有用户行为采集日志数据

现需要对其进行初步采集整合到ODS原始数据层，并且需要求出uv（单一用户访问量）和用户事件相关的一些数据统计和计算，比如转化率（完成整个流程【最终交易】的人数占总人数的比例）

1）创建hive表，导入数据

由于所采集静态数据为json格式，不方便操作，我们将其读取，存入到hive中，再将其导出为CSV格式

准备数据

启动hive元数据服务

hive --service metastore &

启动hive客户端

hive

开启hive2服务

hiveserver2

启动连接beeline客户端

创建hive表

create table tb_log(
log string
)
partitioned by(dt string)；

导入数据到hive表中

load data local inpath "/data/log.log" into table tb_log partition(dt='20201007');

+----------------------------------------------------+------------+
| tb_log.log | tb_log.dt |
+----------------------------------------------------+------------+
| {"account":"l4d9TM","appId":"cn.doitedu.app1","appVersion":"2.0","carrier":"小米移动","deviceId":"KVlABkbMqqwn","deviceType":"REDMI-5","eventId":"adClick","ip":"180.62.91.180","latitude":25.213352855705107,"longitude":100.58718180057615,"netType":"WIFI","osName":"android","osVersion":"6.5","properties":{"adCampain":"7","adId":"6","adLocation":"3","pageId":"41"},"releaseChannel":"豌豆荚","resolution":"1024*768","sessionId":"rGeXt8N0rD1","timeStamp":1602063058368} | 20201007 |
| {"account":"DxL36Dom","appId":"cn.doitedu.app1","appVersion":"2.2","carrier":"中国联通","deviceId":"0HtfcnPofgoR","deviceType":"REDMI-5","eventId":"fetchCoupon","ip":"160.171.79.244","latitude":34.21346724565028,"longitude":117.60034950493103,"netType":"4G","osName":"android","osVersion":"6.5","properties":{"couponId":"2","pageId":"393"},"releaseChannel":"百度手机助手","resolution":"1024*768","sessionId":"meO4lXHYmx1","timeStamp":1602063059899} | 20201007 |
+----------------------------------------------------+------------+

2）初步处理数据、创建原始数据层表 tb_ods_log

create table tb_ods_log as

select
if(account='' ,deviceId , account) as guid,
*
from
(select
json_tuple(
log ,
'account' ,'appId' ,'appVersion','carrier','deviceId','deviceType','eventId','ip','latitude','longitude','netType','osName','osVersion','properties','releaseChannel','resolution','sessionId' ,'timeStamp')
as
(account ,appId ,appVersion,carrier,deviceId,deviceType,eventId,ip,latitude,longitude,netType,osName,osVersion,properties,releaseChannel,resolution,sessionId ,`timeStamp`)
from
tb_log)t
where account != '' or deviceId !='' ;

select * from tb_ods_log limit 2；

3）导出数据到本地 tsv 格式属性之间以tab键符号隔开

insert overwrite local directory '/data/'
row format delimited fields terminated by '\t'
select * from tb_ods_log;

4）创建clickhouse ods表

原始数据已经初步处理了一下，将json串解析成为了字段，并添加了账号标识guid

clickhouse-client 打开客户端

建表
create table tb_ods_log(
guid String,
account String,
appId String,
appVersion String ,
carrier String,
deviceId String,
deviceType String,
eventId String,
ip String,
latitude String,
longitude String,
netType String,
osName String,
osVersion String,
properties String,
releaseChannel String,
resolution String,
sessionId String,
`timeStamp` String
)engine=MergeTree
order by guid;

导入数据

cat 000000_0 | clickhouse-client -q "insert into tb_ods_log FORMAT TSV";

select * from tb_ods_log limit 4;

5）创建需求主题表

CREATE TABLE funnel_test ( uid String, eventId String, eventTime UInt64)
ENGINE = MergeTree
PARTITION BY (uid, eventTime)
ORDER BY (uid, eventTime)；

导入数据

clickhouse-client -q "insert into funnel_test FORMAT TSV" < event.data

event.dada 数据如下：

uid1   event1   1551398404
uid1   event2   1551398406
uid1   event3   1551398408
uid2   event2   1551398412
uid2   event3   1551398415
uid3   event3   1551398417
uid3   event4   1551398419

6）链条漏斗sql实践

select uid,
windowFunnel(4)(toDateTime(eventTime),eventId='event1',eventId='event2',eventId='event3') as funnel
from
funnel_test
group by uid;

┌─uid──┬─funnel─┐
│ uid3 │ 0 │
│ uid1 │ 3 │
│ uid2 │ 0 │
└──────┴────────┘

解析：windowFunnel（n）表示在n毫秒内执行了步骤事件1、事件2、事件3 的步数，如上述查询显示，uid1用户在4毫秒内执行到了第四步，其他人一步也没有执行，这也叫做批处理、流处理过程。

可以用来分析商城注册用户从浏览广告、查看详情、加入购物车、提交订单、支付成功步骤的一个用户执行情况，可以计算转换率，从而为后期的运营进行一个指导。

7）创建clickhouse log2

create table tb_ods_log2 engine=MergeTree order by (guid,`timeStamp`)
as select * from tb_ods_log

8）执行漏斗函数

select `all`,
four,
four+three,
four+three+two,
four+three+two+one
from
(
select
count(1) as `all`,
sum(if(funnel=4,1,0)) as four,
sum(if(funnel=3,1,0)) as three,
sum(if(funnel=2,1,0)) as two,
sum(if(funnel=1,1,0)) as one
from(
select guid,
windowFunnel(1000000)(toDateTime(cast(`timeStamp` as Int64)),
eventId='adShow' ,
eventId='adClick',
eventId='adCart',
eventId='submitOrder'
) as funnel
from tb_ods_log2
group by guid));

更多学习、面试资料尽在微信公众号：Hadoop大数据开发

菜鸟周星星

关注

1
点赞
踩
5

收藏

觉得还不错? 一键收藏
2
评论
clickhouse之连续消费用户和漏斗函数案例实践

目录本篇主要内容一、clickhouse连续消费用户的sql分析二、使用漏斗函数windowFunnel，分析用户操作行为，计算转化率clickhouse连续消费用户的sql分析使用漏斗函数windowFunnel，分析用户操作行为，计算转化率本篇主要内容一、clickhouse连续消费用户的sql分析二、使用漏斗函数windowFunnel，分析用户操作行为，计算转化率clickhouse连续消费用户的sql分析1）准备好数据在本地 lin...
复制链接

扫一扫

专栏目录