5.1.2 电商离线数仓(广告业务)

最新推荐文章于 2023-06-28 17:24:07 发布

weixin_47134119

最新推荐文章于 2023-06-28 17:24:07 发布

阅读量794

点赞数 1

文章标签：数据仓库

本文链接：https://blog.csdn.net/weixin_47134119/article/details/115426896

版权

电商离线数仓

一、电商分析之–广告业务

互联网平台通行的商业模式是利用免费的基础服务吸引凝聚大量用户，并利用这些用
户资源开展广告或其他增值业务实现盈利从而反哺支撑免费服务的生存和发展。广告
收入不仅成为互联网平台的重要收入之一，更决定了互联网平台的发展程度。

电商平台本身就汇聚了海量的商品、店铺的信息，天然适合进行商品的推广。对于电
商和广告主来说，广告投放的目的无非就是吸引更多的用户，最终实现营销转化。因
此非常关注不同位置广告的曝光量、点击量、购买量、点击率、购买率。

第1节需求分析

事件日志数据样例：

{
    "lagou_event":[
        {
            "name":"goods_detail_loading",
            "json":{
                "entry":"3",
                "goodsid":"0",
                "loading_time":"80",
                "action":"4",
                "staytime":"68",
                "showtype":"4"
            },
            "time":1596225273755
        },
        {
            "name":"loading",
            "json":{
                "loading_time":"18",
                "action":"1",
                "loading_type":"2",
                "type":"3"
            },
            "time":1596231657803
        },
        {
            "name":"ad",
            "json":{
                "duration":"17",
                "ad_action":"0",
                "shop_id":"786",
                "event_type":"ad",
                "ad_type":"4",
                "show_style":"1",
                "product_id":"2772",
                "place":"placeindex_left",
                "sort":"0"
            },
            "time":1596278404415
        },
        {
            "name":"favorites",
            "json":{
                "course_id":0,
                "id":0,
                "userid":0
            },
            "time":1596239532527
        },
        {
            "name":"praise",
            "json":{
                "id":2,
                "type":3,
                "add_time":"1596258672095",
                "userid":8,
                "target":6
            },
            "time":1596274343507
        }
    ],
    "attr":{
        "area":"拉萨",
        "uid":"2F10092A86",
        "app_v":"1.1.12",
        "event_type":"common",
        "device_id":"1FB872-9A10086",
        "os_type":"4.1",
        "channel":"KS",
        "language":"chinese",
        "brand":"xiaomi-2"
    }
}

采集的信息包括：

商品详情页加载：goods_detail_loading
商品列表：loading
消息通知：notification
商品评论：comment
收藏：favorites
点赞：praise
广告：ad
- action。用户行为；0 曝光；1 曝光后点击；2 购买
- duration。停留时长
- shop_id。商家id
- event_type。“ad”
- ad_type。格式类型；1 JPG；2 PNG；3 GIF；4 SWF
- show_style。显示风格，0 静态图；1 动态图
- product_id。产品id
- place。广告位置；首页=1，左侧=2，右侧=3，列表页=4
- sort。排序位置

需求指标：
1、点击次数统计(分时统计)
曝光次数、不同用户id数、不同用户数
点击次数、不同用户id数、不同用户数
购买次数、不同用户id数、不同用户数

2、转化率-漏斗分析
点击率 = 点击次数 / 曝光次数
购买率 = 购买次数 / 点击次数

3、活动曝光效果评估：
行为(曝光、点击、购买)、时间段、广告位、产品，统计对应的次数
时间段、广告位、商品，曝光次数最多的前N个

第2节事件日志采集

1、启动Flume Agent（适当的修改参数，128M滚动一次）

# 启动flume
[root@linux122 ~]# flume-ng agent --conf /opt/lagou/servers/flume-1.9.0/conf --conf-file /data/lagoudw/conf/flume-log2hdfs3.conf -name a1 - Dflume.root.logger=INFO,console

2、生成数据（文件大小约640M，100W条事件日志）

cd /data/lagoudw/jars 
java -cp data-generator-1.1-SNAPSHOT-jar-with-dependencies.jar com.lagou.ecommerce.AppEvent 1000000 2020-08-02 > /data/lagoudw/logs/event/events0802.log

3、数据采集完成后，检查HDFS结果

hdfs dfs -ls /user/data/logs/event

第3节 ODS层建表和数据加载

drop table if exists ods.ods_log_event; 
CREATE EXTERNAL TABLE ods.ods_log_event(`str` string) PARTITIONED BY (`dt` string) STORED AS TEXTFILE LOCATION '/user/data/logs/event';

/data/lagoudw/script/advertisement/ods_load_event_log.sh

#!/bin/bash
source /etc/profile
if [ -n "$1" ]
then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
alter table ods.ods_log_event add partition (dt='$do_date');
"
hive -e "$sql"

第4节 DWD层建表和数据加载

ODS：分区；事件的主要信息在json串中(json数组)，公共信息在另外一个json串
中；
ODS => 解析json，从json串中，提取jsonArray数据；将公共信息从json串中解析
出来 => 所有事件的明细

所有事件的明细，包括：

分区
事件(json串)
公共信息字段

所有事件的明细 => 广告json串解析 => 广告事件的明细
广告事件的明细：

分区
广告信息字段
公共信息字段

4.1、DWD层建表

-- 所有事件明细
drop table if exists dwd.dwd_event_log;
CREATE EXTERNAL TABLE dwd.dwd_event_log(
`device_id` string,
`uid` string,
`app_v` string,
`os_type` string,
`event_type` string,
`language` string,
`channel` string,
`area` string,
`brand` string,
`name` string,
`event_json` string,
`report_time` string)
PARTITIONED BY (`dt` string)
stored as parquet;
-- 与广告点击明细
drop table if exists dwd.dwd_ad;
CREATE TABLE dwd.dwd_ad(
`device_id` string,
`uid` string,
`app_v` string,
`os_type` string,
`event_type` string,
`language` string,
`channel` string,
`area` string,
`brand` string,
`report_time` string,
`duration` int,
`ad_action` int,
`shop_id` int,
`ad_type` int,
`show_style` smallint,
`product_id` int,
`place` string,
`sort` int,
`hour` string
)
PARTITIONED BY (`dt` string)
stored as parquet;

4.2、事件json串解析

内建函数、UDF、SerDe（json是所有的信息）
详细内容参见第三部分电商分析之–会员活跃度 => 第4节 json数据处理 => 使用
UDF（处理jsonArray）

package cn.lagou.dw.hive.udf;

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONArray;
import com.alibaba.fastjson.JSONException;
import com.alibaba.fastjson.JSONObject;
import com.google.common.base.Strings;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.junit.Test;

import java.util.ArrayList;

public class ParseJsonArray1 extends UDF {
    public ArrayList<String> evaluate(String jsonStr) {
// 传入空字符串，返回null
        if (Strings.isNullOrEmpty(jsonStr)) {
            return null;
        }
        try {
// 获取jsonArray
            JSONArray jsonArray = JSON.parseArray(jsonStr);
            ArrayList<String> lst = new ArrayList<>();
            for (Object o : jsonArray) {
                lst.add(o.toString());
            }
            return lst;
        } catch (JSONException e) {
            return null;
        }
    }

    @Test
    public void JunitParseJsonArray() {
        String jsonStr = "[{\"name\":\"goods_detail_loading\",\"json\":{\"entry\":\"1\",\"goodsid\":\"0\",\"loading_time\":\"93\",\"action\":\"3\",\"staytime\":\"56\",\"showtype\":\"2\"},\"time\":1596343881690},{\"name\":\"loading\",\"json\":{\"loading_time\":\"15\",\"action\":\"3\",\"loading_type\":\"3\",\"type\":\"1\"},\"time\":1596356988428},{\"name\":\"notification\",\"json\":{\"action\":\"1\",\"type\":\"2\"},\"time\":1596374167278},{\"name\":\"favorites\",\"json\":{\"course_id\":1,\"id\":0,\"userid\":0},\"time\":1596350933962}]";
        ArrayList<String> result = evaluate(jsonStr);
        System.out.println(result.size());
        System.out.println(JSON.toJSONString(result));
    }
}

4.3、DWD层数据加载

主要功能：解析json串；得到全部的事件日志
/data/lagoudw/script/advertisement/dwd_load_event_log.sh

#!/bin/bash
source /etc/profile
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
use dwd;
add jar /data/lagoudw/jars/cn.lagou.dw-1.0-SNAPSHOT-jar-with-dependencies.jar;
create temporary function json_array as'cn.lagou.dw.hive.udf.ParseJsonArray1';
with
tmp_start as
(
select split(str, ' ')[7] as line
from ods.ods_log_event
where dt='$do_date'
)
insert overwrite table dwd.dwd_event_log
PARTITION (dt='$do_date')
select
device_id,
uid,
app_v,
os_type,
event_type,
language,
channel,
area,
brand,
get_json_object(k,'$.name') as name,
get_json_object(k,'$.json') as json,
get_json_object(k,'$.time') as time
from
(
select
get_json_object(line,'$.attr.device_id') as device_id,
get_json_object(line,'$.attr.uid') as uid,
get_json_object(line,'$.attr.app_v') as app_v,
get_json_object(line,'$.attr.os_type') as os_type,
get_json_object(line,'$.attr.event_type') as event_type,
get_json_object(line,'$.attr.language') as language,
get_json_object(line,'$.attr.channel') as channel,
get_json_object(line,'$.attr.area') as area,
get_json_object(line,'$.attr.brand') as brand,
get_json_object(line,'$.lagou_event') as lagou_event
from tmp_start
) A lateral view explode(json_array(lagou_event)) B as k
"
hive -e "$sql"

在这里插入图片描述

从全部的事件日志中获取广告点击事件：
/data/lagoudw/script/advertisement/dwd_load_ad_log.sh

#!/bin/bash
source /etc/profile
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
insert overwrite table dwd.dwd_ad
PARTITION (dt='$do_date')
select
device_id,
uid,
app_v,
os_type,
event_type,
language,
channel,
area,
brand,
report_time,
get_json_object(event_json,'$.duration') ,
get_json_object(event_json,'$.ad_action') ,
get_json_object(event_json,'$.shop_id') ,
get_json_object(event_json,'$.ad_type'),
get_json_object(event_json,'$.show_style'),
get_json_object(event_json,'$.product_id'),
get_json_object(event_json,'$.place'),
get_json_object(event_json,'$.sort'),
from_unixtime(ceil(report_time/1000), 'HH')
from dwd.dwd_event_log
where dt='$do_date' and name='ad';
"
hive -e "$sql"

在这里插入图片描述

日志 => Flume => ODS => 清洗、转换 => 广告事件详细信息

第5节广告点击次数分析

5.1 需求分析

广告：ad

action。用户行为；0 曝光；1 曝光后点击；2 购买
duration。停留时长
shop_id。商家id
event_type。“ad”
ad_type。格式类型；1 JPG；2 PNG；3 GIF；4 SWF
show_style。显示风格，0 静态图；1 动态图
product_id。产品id
place。广告位置；首页=1，左侧=2，右侧=3，列表页=4
sort。排序位置

公共字段

分时统计：
曝光次数、不同用户id数（公共信息中的uid）、不同用户数(公共信息中的
device_id)
点击次数、不同用户id数、不同用户数(device_id)
购买次数、不同用户id数、不同用户数(device_id)
DWD => DWS(不需要) => ADS；在某个分析中不是所有的层都会用到

5.2、创建ADS层表

drop table if exists ads.ads_ad_show;
create table ads.ads_ad_show(
cnt bigint,
u_cnt bigint,
device_cnt bigint,
ad_action tinyint,
hour string
) PARTITIONED BY (`dt` string)
row format delimited fields terminated by ',';

5.3、加载ADS层数据

/data/lagoudw/script/advertisement/ads_load_ad_show.sh

#!/bin/bash
source /etc/profile
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
insert overwrite table ads.ads_ad_show
partition (dt='$do_date')
select count(1),
count(distinct uid),
count(distinct device_id),
ad_action,
hour
from dwd.dwd_ad
where dt='$do_date'
group by ad_action, hour
"
hive -e "$sql"

在这里插入图片描述

第6节漏斗分析(点击率购买率)

6.1、需求分析

分时统计：
点击率 = 点击次数 / 曝光次数
购买率 = 购买次数 / 点击次数

6.2、创建ADS层表

drop table if exists ads.ads_ad_show_rate;
create table ads.ads_ad_show_rate(
hour string,
click_rate double,
buy_rate double
) PARTITIONED BY (`dt` string)
row format delimited fields terminated by ',';

在这里插入图片描述
行转列

-- 方法一
select sum(case when ad_action='0' then cnt end) show_cnt,
sum(case when ad_action='1' then cnt end) click_cnt,
sum(case when ad_action='2' then cnt end) buy_cnt,
hour
from ads.ads_ad_show
where dt='2020-07-21' and hour='02'
group by hour ;
-- 方法二
select max(case when ad_action='0' then cnt end) show_cnt,
max(case when ad_action='1' then cnt end) click_cnt,
max(case when ad_action='2' then cnt end) buy_cnt,
hour
from ads.ads_ad_show
where dt='2020-07-21' and hour='02'
group by hour ;

在这里插入图片描述

6.3、加载ADS层数据

/data/lagoudw/script/advertisement/ads_load_ad_show_rate.sh

#!/bin/bash
source /etc/profile
if [ -n "$1" ] ;
then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
with tmp as(
select max(case when ad_action='0' then cnt end) show_cnt,
max(case when ad_action='1' then cnt end) click_cnt,
max(case when ad_action='2' then cnt end) buy_cnt,
hour
from ads.ads_ad_show
where dt='$do_date'
group by hour
)
insert overwrite table ads.ads_ad_show_rate
partition (dt='$do_date')
select hour,
click_cnt / show_cnt as click_rate,
buy_cnt / click_cnt as buy_rate
from tmp;
"
hive -e "$sql"

在这里插入图片描述

第7节广告效果分析

7.1、需求分析

活动曝光效果评估：
行为(曝光、点击、购买)、时间段、广告位、商品，统计对应的次数
时间段、广告位、商品，曝光次数最多的前100个

7.2、创建ADS层表

drop table if exists ads.ads_ad_show_place;
create table ads.ads_ad_show_place(
ad_action tinyint,
hour string,
place string,
product_id int,
cnt bigint
)PARTITIONED BY (`dt` string)
row format delimited fields terminated by ',';
drop table if exists ads.ads_ad_show_place_window;
create table ads.ads_ad_show_place_window(
hour string,
place string,
product_id int,
cnt bigint,
rank int
)PARTITIONED BY (`dt` string)
row format delimited fields terminated by ',';

7.2、加载ADS层数据

/data/lagoudw/script/advertisement/ads_load_ad_show_page.sh

#!/bin/bash
source /etc/profile
if [ -n "$1" ] ;
then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
insert overwrite table ads.ads_ad_show_place
partition (dt='$do_date')
select ad_action,
hour,
place,
product_id,
count(1)
from dwd.dwd_ad
where dt='$do_date'
group by ad_action, hour, place, product_id;
"
hive -e "$sql"

在这里插入图片描述
/data/lagoudw/script/advertisement/ads_load_ad_show_page_window.sh

#!/bin/bash
source /etc/profile
if [ -n "$1" ] ;
then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
insert overwrite table ads.ads_ad_show_place_window
partition (dt='$do_date')
select *
from (
select hour,
place,
product_id,
cnt,
row_number() over (partition by hour, place,
product_id order by cnt desc) rank
from ads.ads_ad_show_place
where dt='$do_date' and ad_action='0'
) t
where rank <= 100
"
hive -e "$sql"

在这里插入图片描述

小结：分析简单，没有DWS层
Flume、json解析在会员分析讲解

第8节广告分析小结

在这里插入图片描述
脚本调用次序:**

ods_load_event_log.sh
dwd_load_event_log.sh
dwd_load_ad_log.sh
ads_load_ad_show.sh
ads_load_ad_show_rate.sh
ads_load_ad_show_page.sh
ads_load_ad_show_page_window.sh

第9节 ADS层数据导出（DataX）

步骤：

在MySQL创建对应的表
创建配置文件（json）
执行命令，使用json配置文件；测试
编写执行脚本（shell）
shell脚本的测试

1、MySQL 建表

mysql> use dwads;
drop table if exists dwads.ads_ad_show_place;
create table dwads.ads_ad_show_place(
ad_action tinyint,
hour varchar(2),
place varchar(20),
product_id int,
cnt int,
dt varchar(10)
);

2、创建配置文件
/data/lagoudw/script/advertisement/ads_ad_show_place.json

{
    "job":{
        "setting":{
            "speed":{
                "channel":1
            }
        },
        "content":[
            {
                "reader":{
                    "name":"hdfsreader",
                    "parameter":{
                        "path":"/user/hive/warehouse/ads.db/ads_ad_show_place/dt=$do_date/*",
                        "defaultFS":"hdfs://linux121:9000",
                        "column":[
                            {
                                "index":0,
                                "type":"string"
                            },
                            {
                                "index":1,
                                "type":"string"
                            },
                            {
                                "index":2,
                                "type":"string"
                            },
                            {
                                "index":3,
                                "type":"string"
                            },
                            {
                                "index":4,
                                "type":"string"
                            },
                            {
                                "type":"string",
                                "value":"$do_date"
                            }
                        ],
                        "fileType":"text",
                        "encoding":"UTF-8",
                        "fieldDelimiter":","
                    }
                },
                "writer":{
                    "name":"mysqlwriter",
                    "parameter":{
                        "writeMode":"insert",
                        "username":"hive",
                        "password":"12345678",
                        "column":[
                            "ad_action",
                            "hour",
                            "place",
                            "product_id",
                            "cnt",
                            "dt"
                        ],
                        "preSql":[
                            "delete from ads_ad_show_place where dt='$do_date'"
                        ],
                        "connection":[
                            {
                                "jdbcUrl":"jdbc:mysql://linux123:3306/dwads?useUnicode=true&characterEncoding=utf-8",
                                "table":[
                                    "ads_ad_show_place"
                                ]
                            }
                        ]
                    }
                }
            }
        ]
    }
}

3、执行命令(测试)

[root@linux122 ~]# python /opt/lagou/servers/datax/bin/datax.py -p "-Ddo_date=2020-07-21" /data/lagoudw/script/advertisement/ads_ad_show_place.json

在这里插入图片描述

4、编写脚本
/data/lagoudw/script/advertisement/ads_ad_show_place.sh

#!/bin/bash
source /etc/profile
JSON=/data/lagoudw/script
if [ -n "$1" ] ;
then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi

python $DATAX_HOME/bin/datax.py -p "-Ddo_date=$do_date" /data/lagoudw/script/advertisement/ads_ad_show_place.json

5、执行脚本

sh /data/lagoudw/script/advertisement/ads_ad_show_place.sh 2020-07-21

第10节高仿日志数据测试

10.1、数据采集

1000W左右日活用户
按 30条日志 / 人天，合计3亿条事件日志
每条日志 650字节左右
总数据量大概在180G
采集数据时间约2.5小时

1、清理环境
2、启动Flume：

nohup flume-ng agent --conf $FLUME_HOME/conf --conf-file /data/lagoudw/conf/flume-log2hdfs4.conf -name a1 -Dflume.root.logger=INFO,console &

日志文件很大，可以将hdfs文件滚动设置为10G甚至更大

3、写日志

cd /data/lagoudw/jars
nohup java -cp data-generator-1.1-SNAPSHOT-jar-with-dependencies.jar com.lagou.ecommerce.AppEvent 300000000 2020-08-03 > /data/lagoudw/logs/event/eventlog0803.log &

10.2、执行脚本

sh ods_load_event_log.sh 2020-08-03
sh dwd_load_event_log.sh 2020-08-03
sh dwd_load_ad_log.sh 2020-08-03
sh ads_load_ad_show.sh 2020-08-03
sh ads_load_ad_show_rate.sh 2020-08-03
sh ads_load_ad_show_page.sh 2020-08-03
sh ads_load_ad_show_page_window.sh 2020-08-03

weixin_47134119

关注

1
点赞
踩
12

收藏

觉得还不错? 一键收藏
0
评论
5.1.2 电商离线数仓(广告业务)

电商离线数仓文章目录电商离线数仓一、电商分析之--广告业务第1节需求分析第2节事件日志采集第3节 ODS层建表和数据加载第4节 DWD层建表和数据加载4.1、DWD层建表4.2、事件json串解析4.3、DWD层数据加载第5节广告点击次数分析5.1 需求分析**广告：ad****公共字段**5.2、创建ADS层表5.3、加载ADS层数据第6节漏斗分析(点击率购买率)6.1、需求分析6.2、创建ADS层表6.3、加载ADS层数据第7节广告效果分析7.1、需求分析7.2、创建ADS层表7.2、加载
复制链接

扫一扫