使用ClickHouse实现,累计用户计算模型

问题描述:根据用户标识和历史库的匹配结果,识别是否是新增用户,单位:天

要求:历史库每天累加更新,要考录用户历史数据库的幂等性及回补数据策略

输出:

  • 用户pushid
  • pushid对应的uid(如果当天没有没有登录就没有对应的pushid则从历史库中匹配)
  • pushid当天和uid是否有对应关系
  • 用户新增时间
  • 用户历史所有投资次数
  • 当天用户投资次数
  • 用户每次投资时间(rechargeTime)

 

说明:

用户标识有两个 pushid、uid,pushid表示用户的注册id,登录的时候才会存在,uid是用户访问的cookie(会频繁变化)。

因此在业务中要关联两者之间的关系。

 

创建历史库:

CREATE TABLE IF NOT EXISTS `db_name`.`table_name` (
    partition Date DEFAULT '1970-01-01',
    pushid String DEFAULT '',
    opTime DateTime DEFAULT 0,
    rechargeTime DateTime DEFAULT 0, # rechargeTime如果不是默认值则表示用户发生投资时间
    appkey String DEFAULT '',
    uid String DEFAULT '',
    ver UInt64 DEFAULT 0
)
ENGINE = ReplacingMergeTree(partition, (pushid, rechargeTime), 8192, ver)

利用ReplacingMergeTree实现数据幂等性,当重复入库数据时会去除重复项,保证数据执行多次时数据不重复。

ver表示版本号,当数据重复时,会以最大的版本号为准,版本号可以是一个递增的数字,业务中数据的版本号是插入的时的时间戳。

*其中:(pushid, rechargeTime)中的rechargeTime表示用户复投时间,如果业务中没有对用户发生某一个行为特殊要求则可以删除。

 

历史库更新代码(每天更新):

INSERT INTO db_name.table_name SELECT 
    partition, 
    pushid, 
    opTime, 
    rechargeTime, 
    appkey, 
    uid, 
    ver
FROM 
(
    SELECT 
        partition, 
        jhd_pushid AS pushid, 
        jhd_opTime AS opTime, 
        jhd_opTime AS rechargeTime, 
        jhd_datatype AS appkey, 
        jhd_userkey AS uid, 
        CAST(835664 AS UInt64) AS ver
    FROM ncf_h5.userevent 
    WHERE (partition = toDate('2017-03-28')) AND (jhd_pushid != '') AND (jhd_opType = 'page') AND (visitParamExtractString(jhd_map, 'uri') LIKE '%/pay_result%')
    UNION ALL 
    SELECT 
        partition, 
        jhd_pushid AS pushid, 
        min(jhd_opTime) AS opTime, 
        toDateTime('1970-01-01 00:00:00') AS rechargeTime, 
        jhd_datatype AS appkey, 
        jhd_userkey AS uid, 
        CAST(835664 AS UInt64) AS ver
    FROM ncf_h5.userevent 
    WHERE (partition = toDate('2017-03-28')) AND (jhd_pushid != '')
    GROUP BY 
        jhd_datatype, 
        partition, 
        pushid, 
        jhd_userkey
    UNION ALL 
    SELECT 
        partition, 
        jhd_pushid AS pushid, 
        jhd_opTime AS opTime, 
        jhd_opTime AS rechargeTime, 
        jhd_datatype AS appkey, 
        jhd_userkey AS uid, 
        CAST(835664 AS UInt64) AS ver
    FROM ncf_ws.userevent 
    WHERE (partition = toDate('2017-03-28')) AND (jhd_pushid != '') AND (jhd_opType = 'page') AND (visitParamExtractString(jhd_map, 'uri') LIKE '%/success%')
    UNION ALL 
    SELECT 
        partition, 
        jhd_pushid AS pushid, 
        min(jhd_opTime) AS opTime, 
        toDateTime('1970-01-01 00:00:00') AS rechargeTime, 
        jhd_datatype AS appkey, 
        jhd_userkey AS uid, 
        CAST(835664 AS UInt64) AS ver
    FROM ncf_ws.userevent 
    WHERE (partition = toDate('2017-03-28')) AND (jhd_pushid != '')
    GROUP BY 
        jhd_datatype, 
        partition, 
        pushid, 
        jhd_userkey
)

数据导出代码:

数据格式:

pushid、是否当天登录、uid、新增时间、用户历史所有投资次数、当天用户投资次数、用户每次投资时间

 

SELECT 
    pushid, 
    1 AS isfind, 
    uids, 
    earliest, 
    recharge_n, 
    recharge_today, 
    recharge_arr
FROM 
(
    SELECT 
        pushid, 
        CAST(earliest AS String) AS earliest, 
        recharge_n, 
        recharge_today, 
        arrayMap(lambda(tuple(x), CAST(x AS String)), arrayFilter(lambda(tuple(x), x != '1970-01-01 00:00:00'), recharge_arr)) AS recharge_arr, 
        arrayFilter(lambda(tuple(x), x != ''), uids) AS uids
    FROM 
    (
        SELECT pushid, groupUniqArray(uid) AS uids
        FROM ncf_common.user_pushid 
        WHERE partition = toDate('2017-04-04')
        GROUP BY pushid
    ) 
    ANY LEFT JOIN 
    (
        SELECT 
            pushid, 
            min(opTime) AS earliest, 
            sumIf(1, rechargeTime != '1970-01-01 00:00:00') AS recharge_n, 
            sumIf(1, toDate(rechargeTime) = toDate('2017-04-04')) AS recharge_today, 
            groupArray(rechargeTime) AS recharge_arr
        FROM ncf_common.user_pushid 
        WHERE (partition <= toDate('2017-04-04')) AND (partition >= (toDate('2017-04-04') - 365))
        GROUP BY pushid
    ) USING (pushid)
) 
ARRAY JOIN uids
UNION ALL 
SELECT 
    pushid, 
    0 AS isfind, 
    uids, 
    CAST(earliest AS String) AS earliest, 
    recharge_n, 
    recharge_today, 
    arrayMap(lambda(tuple(x), CAST(x AS String)), arrayFilter(lambda(tuple(x), x != '1970-01-01 00:00:00'), recharge_arr)) AS recharge_arr
FROM 
(
    SELECT 
        pushid, 
        groupUniqArray(uid) AS uids, 
        min(opTime) AS earliest, 
        sumIf(1, rechargeTime != '1970-01-01 00:00:00') AS recharge_n, 
        sumIf(1, toDate(rechargeTime) = toDate('2017-04-04')) AS recharge_today, 
        arrayFilter(lambda(tuple(x), x != '1970-01-01 00:00:00'), groupArray(rechargeTime)) AS recharge_arr
    FROM 
    (
        SELECT 
            pushid, 
            uid, 
            opTime, 
            rechargeTime
        FROM 
        (
            SELECT 
                jhd_userkey AS uid, 
                groupUniqArray(jhd_pushid) AS pushids, 
                'ncf_ws' AS appkey
            FROM ncf_ws.userevent 
            WHERE partition = toDate('2017-04-04')
            GROUP BY jhd_userkey
            HAVING (length(pushids) = 1) AND has(pushids, '')
            UNION ALL 
            SELECT 
                jhd_userkey AS uid, 
                groupUniqArray(jhd_pushid) AS pushids, 
                'ncf_h5' AS appkey
            FROM ncf_h5.userevent 
            WHERE partition = toDate('2017-04-04')
            GROUP BY jhd_userkey
            HAVING (length(pushids) = 1) AND has(pushids, '')
        ) 
        ALL INNER JOIN 
        (
            SELECT 
                pushid, 
                uid, 
                opTime, 
                rechargeTime, 
                appkey
            FROM ncf_common.user_pushid 
            WHERE partition < toDate('2017-04-04')
        ) USING (uid)
    ) 
    GROUP BY pushid
) 
ARRAY JOIN uids

 

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
数据分析职业是一个多金的职业,数据分析职位是一个金饭碗的职位,前景美好,但是要全面掌握大数据分析技术,非常困难,大部分学员的痛点是不能快速找到入门要点,精准快速上手。本课程采用项目驱动的方式,以Spark3和Clickhouse技术为突破口,带领学员快速入门Spark3+Clickhouse数据分析,促使学员成为一名高效且优秀的大数据分析人才。学员通过本课程的学习,不仅可以掌握使用Python3进行Spark3数据分析,还会掌握利用Scala/java进行Spark数据分析,多语言并进,力求全面掌握;另外通过项目驱动,掌握Spark框架的精髓,教导Spark源码查看的技巧;会学到Spark性能优化的核心要点,成为企业急缺的数据分析人才;更会通过Clickhouse和Spark搭建OLAP引擎,使学员对大数据生态圈有一个更加全面的认识和能力的综合提升。真实的数据分析项目,学完即可拿来作为自己的项目经验,增加面试谈薪筹码。课程涉及内容:Ø  Spark内核原理(RDD、DataFrame、Dataset、Structed Stream、SparkML、SparkSQL)Ø  Spark离线数据分析(千万简历数据分析、雪花模型离线数仓构建)Ø  Spark特征处理及模型预测Ø  Spark实时数据分析(Structed Stream)原理及实战Ø  Spark+Hive构建离线数据仓库(数仓概念ODS/DWD/DWS/ADS)Ø  Clickhouse核心原理及实战Ø  Clickhouse engine详解Ø  Spark向Clickhouse导入简历数据,进行数据聚合分析Ø  catboost训练房价预测机器学习模型Ø  基于Clickhouse构建机器学习模型利用SQL进行房价预测Ø  Clickhouse集群监控,Nginx反向代理Grafana+Prometheus+Clickhouse+node_exporterØ  Spark性能优化Ø  Spark工程师面试宝典       课程组件:集群监控:福利:本课程凡是消费满359的学员,一律送出价值109元的实体书籍.

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值