clickhouse 多用户_clickhouse之连续消费用户和漏斗函数案例实践

目录

本篇主要内容一、clickhouse连续消费用户的sql分析

二、使用漏斗函数 windowFunnel,分析用户操作行为,计算转化率

clickhouse连续消费用户的sql分析1)准备好数据 在本地 linux01 /data/user.csv

a,2017-02-05,200

a,2017-02-06,300

a,2017-02-07,200

a,2017-02-08,400

a,2017-02-10,600

b,2017-02-05,200

b,2017-02-06,300

b,2017-02-08,200

b,2017-02-09,400

b,2017-02-10,600

c,2017-01-31,200

c,2017-02-01,300

c,2017-02-02,200

c,2017-02-03,400

c,2017-02-10,600

a,2017-03-01,200

a,2017-03-02,300

a,2017-03-03,200

a,2017-03-04,400

a,2017-03-05,600

2)clickhouse 中创建表 【表引擎为MergeTree ,指定主键为name,时间,排序字段为name,时间】

create table tb_shop

(name String,

ctime Date,

money Float64

)engine=MergeTree

primary key(name,ctime)

order by(name,ctime);

3)导入数据

clickhouse-client -q "INSERT INTO default.tb_shop FORMAT CSV" < user.csv

也可以使用 cat user.csv | clickhouse-client -q "INSERT INTO default.tb_shop FORMAT CSV"; 导入数据

注:需要在本地linux01 /data/ 下 执行该命令

4)分组,查询每个人 对应的消费时间(数组)、编号数组

select name,

groupArray(ctime) arr,

arrayEnumerate(arr) arr2

from tb_shop

group by name;

┌─name─┬─arr─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─arr2───────────────────┐

│ b │ ['2017-02-05','2017-02-06','2017-02-08','2017-02-09','2017-02-10'] │ [1,2,3,4,5] │

│ c │ ['2017-01-31','2017-02-01','2017-02-02','2017-02-03','2017-02-10'] │ [1,2,3,4,5] │

│ a │ ['2017-02-05','2017-02-06','2017-02-07','2017-02-08','2017-02-10','2017-03-01','2017-03-02','2017-03-03','2017-03-04','2017-03-05'] │ [1,2,3,4,5,6,7,8,9,10] │

└──────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────────────────────┘

5)使用array join 连续连接 arr和arr2 进行多行拼接

select

name,

t,

diff

from(

select

name,

arr,

arr2

from

(

select name,

groupArray(ctime) arr,

arrayEnumerate(arr) arr2

from tb_shop

group by name))

array join arr as t,

arr2 as diff;

6)求出ctime和编号之差,

并通过用户名和差dif 分组,聚合求出次数

select

name,

addDays(t, -diff) as dif,

count(1) as cnt

from

(

select

name,

t,

diff

from(

select

name,

arr,

arr2

from

(

select name,

groupArray(ctime) arr,

arrayEnumerate(arr) arr2

from tb_shop

group by name))

array join arr as t,

arr2 as diff) group by name,dif;

7)再分组聚合,求出最大的连续消费天数

select

name,

max(cnt)

from(

select

name,

addDays(t, -diff) as dif,

count(1) as cnt

from

(

select

name,

t,

diff

from(

select

name,

arr,

arr2

from

(

select name,

groupArray(ctime) arr,

arrayEnumerate(arr) arr2

from tb_shop

group by name))

array join arr as t,

arr2 as diff) group by name,dif) group by name;

本案例要点:clickhouse中创建表一定要指定引擎。

MergeTree表引擎,需要指定主键,本案例中,指定主键为name和ctime,排序字段为name和ctime,方便后续数据处理。记住加上小括号,否则报错。

在hive中编号函数为row_number() 加上窗口函数 over(partition by name order by ctime) 就可以对每个用户组的每行数据进行编号;而在clickhouse中,arrayEnumerate(arr)可以求出为数组中的每个元素编上号,但结果也是个数组。groupArray(ctime) arr,求出的也是时间的数组,我们使用array join使其两个数组进行炸裂拼接,得到多行对应数据。

编号不是目的,目的是为了得到时间和编号之间的差,方便依据其进行分组得到连续性数据。

使用漏斗函数 windowFunnel,分析用户操作行为,计算转化率漏斗函数

现有用户行为采集日志数据

现需要对其进行初步采集整合到ODS原始数据层,并且需要求出uv(单一用户访问量)和 用户 事件相关的一些数据统计和计算,比如转化率(完成整个流程【最终交易】的人数占总人数的比例)

1)创建hive表,导入数据

由于所采集静态数据为json格式,不方便操作,我们将其读取,存入到hive中,再将其导出为CSV格式

准备数据

启动hive元数据服务

hive --service metastore &

启动hive客户端

hive

开启hive2服务

hiveserver2

启动连接beeline客户端

创建hive表

create table tb_log(

log string

)

partitioned by(dt string);

导入数据到hive表中

load data local inpath "/data/log.log" into table tb_log partition(dt='20201007');

+----------------------------------------------------+------------+

| tb_log.log | tb_log.dt |

+----------------------------------------------------+------------+

| {"account":"l4d9TM","appId":"cn.doitedu.app1","appVersion":"2.0","carrier":"小米移动","deviceId":"KVlABkbMqqwn","deviceType":"REDMI-5","eventId":"adClick","ip":"180.62.91.180","latitude":25.213352855705107,"longitude":100.58718180057615,"netType":"WIFI","osName":"android","osVersion":"6.5","properties":{"adCampain":"7","adId":"6","adLocation":"3","pageId":"41"},"releaseChannel":"豌豆荚","resolution":"1024*768","sessionId":"rGeXt8N0rD1","timeStamp":1602063058368} | 20201007 |

| {"account":"DxL36Dom","appId":"cn.doitedu.app1","appVersion":"2.2","carrier":"中国联通","deviceId":"0HtfcnPofgoR","deviceType":"REDMI-5","eventId":"fetchCoupon","ip":"160.171.79.244","latitude":34.21346724565028,"longitude":117.60034950493103,"netType":"4G","osName":"android","osVersion":"6.5","properties":{"couponId":"2","pageId":"393"},"releaseChannel":"百度手机助手","resolution":"1024*768","sessionId":"meO4lXHYmx1","timeStamp":1602063059899} | 20201007 |

+----------------------------------------------------+------------+

2)初步处理数据、创建原始数据层表 tb_ods_log

create table tb_ods_log as

select

if(account='' ,deviceId , account) as guid,

*

from

(select

json_tuple(

log ,

'account' ,'appId' ,'appVersion','carrier','deviceId','deviceType','eventId','ip','latitude','longitude','netType','osName','osVersion','properties','releaseChannel','resolution','sessionId' ,'timeStamp')

as

(account ,appId ,appVersion,carrier,deviceId,deviceType,eventId,ip,latitude,longitude,netType,osName,osVersion,properties,releaseChannel,resolution,sessionId ,`timeStamp`)

from

tb_log)t

where account != '' or deviceId !='' ;

select * from tb_ods_log limit 2;

3)导出数据到本地 tsv 格式属性之间以tab键符号隔开

insert overwrite local directory '/data/'

row format delimited fields terminated by '\t'

select * from tb_ods_log;

4)创建clickhouse ods表

原始数据已经初步处理了一下,将json串解析成为了字段,并添加了账号标识guid

clickhouse-client 打开客户端

建表

create table tb_ods_log(

guid String,

account String,

appId String,

appVersion String ,

carrier String,

deviceId String,

deviceType String,

eventId String,

ip String,

latitude String,

longitude String,

netType String,

osName String,

osVersion String,

properties String,

releaseChannel String,

resolution String,

sessionId String,

`timeStamp` String

)engine=MergeTree

order by guid;

导入数据

cat 000000_0 | clickhouse-client -q "insert into tb_ods_log FORMAT TSV";

select * from tb_ods_log limit 4;

5)创建需求主题表

CREATE TABLE funnel_test ( uid String, eventId String, eventTime UInt64)

ENGINE = MergeTree

PARTITION BY (uid, eventTime)

ORDER BY (uid, eventTime);

导入数据

clickhouse-client -q "insert into funnel_test FORMAT TSV" < event.data

event.dada 数据如下:

uid1 event1 1551398404

uid1 event2 1551398406

uid1 event3 1551398408

uid2 event2 1551398412

uid2 event3 1551398415

uid3 event3 1551398417

uid3 event4 1551398419

6)链条漏斗sql实践

select uid,

windowFunnel(4)(toDateTime(eventTime),eventId='event1',eventId='event2',eventId='event3') as funnel

from

funnel_test

group by uid;

┌─uid──┬─funnel─┐

│ uid3 │ 0 │

│ uid1 │ 3 │

│ uid2 │ 0 │

└──────┴────────┘

解析:windowFunnel(n)表示在n毫秒内 执行了步骤事件1、事件2、事件3 的步数 ,如上述查询显示,uid1用户在4毫秒内执行到了第四步,其他人一步也没有执行,这也叫做批处理、流处理过程。

可以用来分析商城注册用户从浏览广告、查看详情、加入购物车、提交订单、支付成功步骤的一个用户执行情况,可以计算转换率,从而为后期的运营进行一个指导。

7)创建clickhouse log2

create table tb_ods_log2 engine=MergeTree order by (guid,`timeStamp`)

as select * from tb_ods_log

8)执行漏斗函数

select `all`,

four,

four+three,

four+three+two,

four+three+two+one

from

(

select

count(1) as `all`,

sum(if(funnel=4,1,0)) as four,

sum(if(funnel=3,1,0)) as three,

sum(if(funnel=2,1,0)) as two,

sum(if(funnel=1,1,0)) as one

from(

select guid,

windowFunnel(1000000)(toDateTime(cast(`timeStamp` as Int64)),

eventId='adShow' ,

eventId='adClick',

eventId='adCart',

eventId='submitOrder'

) as funnel

from tb_ods_log2

group by guid));

更多学习、面试资料尽在微信公众号:Hadoop大数据开发

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值