clickhouse 多用户_clickhouse之连续消费用户和漏斗函数案例实践

最新推荐文章于 2024-02-27 21:40:24 发布

之深

最新推荐文章于 2024-02-27 21:40:24 发布

阅读量587

点赞数

文章标签： clickhouse 多用户

本文链接：https://blog.csdn.net/weixin_36331764/article/details/112956467

版权

本篇主要内容一、clickhouse连续消费用户的sql分析

二、使用漏斗函数 windowFunnel，分析用户操作行为，计算转化率

clickhouse连续消费用户的sql分析1)准备好数据在本地 linux01 /data/user.csv

a,2017-02-05,200

a,2017-02-06,300

a,2017-02-07,200

a,2017-02-08,400

a,2017-02-10,600

b,2017-02-05,200

b,2017-02-06,300

b,2017-02-08,200

b,2017-02-09,400

b,2017-02-10,600

c,2017-01-31,200

c,2017-02-01,300

c,2017-02-02,200

c,2017-02-03,400

c,2017-02-10,600

a,2017-03-01,200

a,2017-03-02,300

a,2017-03-03,200

a,2017-03-04,400

a,2017-03-05,600

2)clickhouse 中创建表【表引擎为MergeTree ，指定主键为name，时间，排序字段为name，时间】

create table tb_shop

(name String,

ctime Date,

money Float64

)engine=MergeTree

primary key(name,ctime)

order by(name,ctime);

3)导入数据

clickhouse-client -q "INSERT INTO default.tb_shop FORMAT CSV" < user.csv

也可以使用 cat user.csv | clickhouse-client -q "INSERT INTO default.tb_shop FORMAT CSV"; 导入数据

注：需要在本地linux01 /data/ 下执行该命令

4)分组，查询每个人对应的消费时间(数组)、编号数组

select name,

groupArray(ctime) arr,

arrayEnumerate(arr) arr2

from tb_shop

group by name；

┌─name─┬─arr─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─arr2───────────────────┐

│ b │ ['2017-02-05','2017-02-06','2017-02-08','2017-02-09','2017-02-10'] │ [1,2,3,4,5] │

│ c │ ['2017-01-31','2017-02-01','2017-02-02','2017-02-03','2017-02-10'] │ [1,2,3,4,5] │

│ a │ ['2017-02-05','2017-02-06','2017-02-07','2017-02-08','2017-02-10','2017-03-01','2017-03-02','2017-03-03','2017-03-04','2017-03-05'] │ [1,2,3,4,5,6,7,8,9,10] │

└──────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────────────────────┘

5)使用array join 连续连接 arr和arr2 进行多行拼接

select

name,

diff

from(

select

name,

arr,

arr2

from

(

select name,

groupArray(ctime) arr,

arrayEnumerate(arr) arr2

from tb_shop

group by name))

array join arr as t,

arr2 as diff；

6)求出ctime和编号之差，

并通过用户名和差dif 分组，聚合求出次数

select

name,

addDays(t, -diff) as dif,

count(1) as cnt

from

(

select

name,

diff

from(

select

name,

arr,

arr2

from

(

select name,

groupArray(ctime) arr,

arrayEnumerate(arr) arr2

from tb_shop

group by name))

array join arr as t,

arr2 as diff) group by name,dif；

7)再分组聚合，求出最大的连续消费天数

select

name,

max(cnt)

from(

select

name,

addDays(t, -diff) as dif,

count(1) as cnt

from

(

select

name,

diff

from(

select

name,

arr,

arr2

from

(

select name,

groupArray(ctime) arr,

arrayEnumerate(arr) arr2

from tb_shop

group by name))

array join arr as t,

arr2 as diff) group by name,dif) group by name;

本案例要点：clickhouse中创建表一定要指定引擎。

MergeTree表引擎，需要指定主键，本案例中，指定主键为name和ctime，排序字段为name和ctime，方便后续数据处理。记住加上小括号，否则报错。

在hive中编号函数为row_number() 加上窗口函数 over(partition by name order by ctime) 就可以对每个用户组的每行数据进行编号；而在clickhouse中，arrayEnumerate(arr)可以求出为数组中的每个元素编上号，但结果也是个数组。groupArray(ctime) arr，求出的也是时间的数组，我们使用array join使其两个数组进行炸裂拼接，得到多行对应数据。

编号不是目的，目的是为了得到时间和编号之间的差，方便依据其进行分组得到连续性数据。

使用漏斗函数 windowFunnel，分析用户操作行为，计算转化率漏斗函数

现有用户行为采集日志数据

现需要对其进行初步采集整合到ODS原始数据层，并且需要求出uv(单一用户访问量)和用户事件相关的一些数据统计和计算，比如转化率(完成整个流程【最终交易】的人数占总人数的比例)

1)创建hive表，导入数据

由于所采集静态数据为json格式，不方便操作，我们将其读取，存入到hive中，再将其导出为CSV格式

准备数据

启动hive元数据服务

hive --service metastore &

启动hive客户端

hive

开启hive2服务

hiveserver2

启动连接beeline客户端

创建hive表

create table tb_log(

log string

)

partitioned by(dt string)；

导入数据到hive表中

load data local inpath "/data/log.log" into table tb_log partition(dt='20201007');

+----------------------------------------------------+------------+

| tb_log.log | tb_log.dt |

+----------------------------------------------------+------------+

| {"account":"l4d9TM","appId":"cn.doitedu.app1","appVersion":"2.0","carrier":"小米移动","deviceId":"KVlABkbMqqwn","deviceType":"REDMI-5","eventId":"adClick","ip":"180.62.91.180","latitude":25.213352855705107,"longitude":100.58718180057615,"netType":"WIFI","osName":"android","osVersion":"6.5","properties":{"adCampain":"7","adId":"6","adLocation":"3","pageId":"41"},"releaseChannel":"豌豆荚","resolution":"1024*768","sessionId":"rGeXt8N0rD1","timeStamp":1602063058368} | 20201007 |

| {"account":"DxL36Dom","appId":"cn.doitedu.app1","appVersion":"2.2","carrier":"中国联通","deviceId":"0HtfcnPofgoR","deviceType":"REDMI-5","eventId":"fetchCoupon","ip":"160.171.79.244","latitude":34.21346724565028,"longitude":117.60034950493103,"netType":"4G","osName":"android","osVersion":"6.5","properties":{"couponId":"2","pageId":"393"},"releaseChannel":"百度手机助手","resolution":"1024*768","sessionId":"meO4lXHYmx1","timeStamp":1602063059899} | 20201007 |

+----------------------------------------------------+------------+

2)初步处理数据、创建原始数据层表 tb_ods_log

create table tb_ods_log as

select

if(account='' ,deviceId , account) as guid,

from

(select

json_tuple(

log ,

'account' ,'appId' ,'appVersion','carrier','deviceId','deviceType','eventId','ip','latitude','longitude','netType','osName','osVersion','properties','releaseChannel','resolution','sessionId' ,'timeStamp')

(account ,appId ,appVersion,carrier,deviceId,deviceType,eventId,ip,latitude,longitude,netType,osName,osVersion,properties,releaseChannel,resolution,sessionId ,`timeStamp`)

from

tb_log)t

where account != '' or deviceId !='' ;

select * from tb_ods_log limit 2；

3)导出数据到本地 tsv 格式属性之间以tab键符号隔开

insert overwrite local directory '/data/'

row format delimited fields terminated by '\t'

select * from tb_ods_log;

4)创建clickhouse ods表

原始数据已经初步处理了一下，将json串解析成为了字段，并添加了账号标识guid

clickhouse-client 打开客户端

建表

create table tb_ods_log(

guid String,

account String,

appId String,

appVersion String ,

carrier String,

deviceId String,

deviceType String,

eventId String,

ip String,

latitude String,

longitude String,

netType String,

osName String,

osVersion String,

properties String,

releaseChannel String,

resolution String,

sessionId String,

`timeStamp` String

)engine=MergeTree

order by guid;

导入数据

cat 000000_0 | clickhouse-client -q "insert into tb_ods_log FORMAT TSV";

select * from tb_ods_log limit 4;

5)创建需求主题表

CREATE TABLE funnel_test ( uid String, eventId String, eventTime UInt64)

ENGINE = MergeTree

PARTITION BY (uid, eventTime)

ORDER BY (uid, eventTime)；

导入数据

clickhouse-client -q "insert into funnel_test FORMAT TSV" < event.data

event.dada 数据如下：

uid1 event1 1551398404

uid1 event2 1551398406

uid1 event3 1551398408

uid2 event2 1551398412

uid2 event3 1551398415

uid3 event3 1551398417

uid3 event4 1551398419

6)链条漏斗sql实践

select uid,

windowFunnel(4)(toDateTime(eventTime),eventId='event1',eventId='event2',eventId='event3') as funnel

from

funnel_test

group by uid;

┌─uid──┬─funnel─┐

│ uid3 │ 0 │

│ uid1 │ 3 │

│ uid2 │ 0 │

└──────┴────────┘

解析：windowFunnel(n)表示在n毫秒内执行了步骤事件1、事件2、事件3 的步数，如上述查询显示，uid1用户在4毫秒内执行到了第四步，其他人一步也没有执行，这也叫做批处理、流处理过程。

可以用来分析商城注册用户从浏览广告、查看详情、加入购物车、提交订单、支付成功步骤的一个用户执行情况，可以计算转换率，从而为后期的运营进行一个指导。

7)创建clickhouse log2

create table tb_ods_log2 engine=MergeTree order by (guid,`timeStamp`)

as select * from tb_ods_log

8)执行漏斗函数

select `all`,

four,

four+three,

four+three+two,

four+three+two+one

from

(

select

count(1) as `all`,

sum(if(funnel=4,1,0)) as four,

sum(if(funnel=3,1,0)) as three,

sum(if(funnel=2,1,0)) as two,

sum(if(funnel=1,1,0)) as one

from(

select guid,

windowFunnel(1000000)(toDateTime(cast(`timeStamp` as Int64)),

eventId='adShow' ,

eventId='adClick',

eventId='adCart',

eventId='submitOrder'

) as funnel

from tb_ods_log2

group by guid));

更多学习、面试资料尽在微信公众号：Hadoop大数据开发

之深

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
clickhouse 多用户_clickhouse之连续消费用户和漏斗函数案例实践

目录本篇主要内容一、clickhouse连续消费用户的sql分析二、使用漏斗函数 windowFunnel，分析用户操作行为，计算转化率clickhouse连续消费用户的sql分析1)准备好数据在本地 linux01 /data/user.csva,2017-02-05,200a,2017-02-06,300a,2017-02-07,200a,2017-02-08,400a,2017-...
复制链接

扫一扫