文章目录
业务知识准备
业务术语
- 用户
用户以设备为判断标准,在移动统计中,每个独立设备认为是一个独立用户。Android系统根据IMEI号,IOS系统根据OpenUDID来标识一个独立用户,每部手机一个用户。 - 新增用户
首次联网使用应用的用户。如果一个用户首次打开某APP,那这个用户定义为新增用户;卸载再安装的设备,不会被算作一次新增。新增用户包括日新增用户、周新增用户、月新增用户。 - 活跃用户
打开应用的用户即为活跃用户,不考虑用户的使用情况。每天一台设备打开多次会被计为一个活跃用户。 - 周(月)活跃用户
某个自然周(月)内启动过应用的用户,该周(月)内的多次启动只记一个活跃用户。 - 月活跃率
月活跃用户与截止到该月累计的用户总和之间的比例。 - 沉默用户
用户仅在安装当天(次日)启动一次,后续时间无再启动行为。该指标可以反映新增用户质量和用户与APP的匹配程度。 - 版本分布
不同版本的周内各天新增用户数,活跃用户数和启动次数。利于判断APP各个版本之间的优劣和用户行为习惯。 - 本周回流用户
上周未启动过应用,本周启动了应用的用户。 - 连续n周活跃用户
连续n周,每周至少启动一次。 - 忠诚用户
连续活跃5周以上的用户 - 连续活跃用户
连续2周及以上活跃的用户 - 近期流失用户
连续n(2<= n <= 4)周没有启动应用的用户。(第n+1周没有启动过) - 留存用户
某段时间内的新增用户,经过一段时间后,仍然使用应用的被认作是留存用户;这部分用户占当时新增用户的比例即是留存率。
例如,5月份新增用户200,这200人在6月份启动过应用的有100人,7月份启动过应用的有80人,8月份启动过应用的有50人;则5月份新增用户一个月后的留存率是50%,二个月后的留存率是40%,三个月后的留存率是25%。 - 用户新鲜度
每天启动应用的新老用户比例,即新增用户数占活跃用户数的比例。 - 单次使用时长
每次启动使用的时间长度。 - 日使用时长
累计一天内的使用时间长度。 - 启动次数计算标准
IOS平台应用退到后台就算一次独立的启动;Android平台我们规定,两次启动之间的间隔小于30秒,被计算一次启动。用户在使用过程中,若因收发短信或接电话等退出应用30秒又再次返回应用中,那这两次行为应该是延续而非独立的,所以可以被算作一次使用行为,即一次启动。业内大多使用30秒这个标准,但用户还是可以自定义此时间间隔。
系统函数
collect_set函数
1)创建原数据表
hive (gmall)>
drop table if exists stud;
create table stud (name string, area string, course string, score int);
2)向原数据表中插入数据
hive (gmall)>
insert into table stud values('zhang3','bj','math',88);
insert into table stud values('li4','bj','math',99);
insert into table stud values('wang5','sh','chinese',92);
insert into table stud values('zhao6','sh','chinese',54);
insert into table stud values('tian7','bj','chinese',91);
3)查询表中数据
hive (gmall)> select * from stud;
stud.name stud.area stud.course stud.score
zhang3 bj math 88
li4 bj math 99
wang5 sh chinese 92
zhao6 sh chinese 54
tian7 bj chinese 91
4)把同一分组的不同行的数据聚合成一个集合
hive (gmall)> select course, collect_set(area), avg(score) from stud group by course;
chinese ["sh","bj"] 79.0
math ["bj"] 93.5
5) 用下标可以取某一个
hive (gmall)> select course, collect_set(area)[0], avg(score) from stud group by course;
chinese sh 79.0
math bj 93.5
日期处理函数
1)date_format函数(根据格式整理日期)
hive (gmall)> select date_format('2019-02-10','yyyy-MM');
2019-02
2)date_add函数(加减日期)
hive (gmall)> select date_add('2019-02-10',-1);
2019-02-09
hive (gmall)> select date_add('2019-02-10',1);
2019-02-11
3)next_day函数
(1)取当前天的下一个周一
hive (gmall)> select next_day('2019-02-12','MO')
2019-02-18
说明:星期一到星期日的英文(Monday,Tuesday、Wednesday、Thursday、Friday、Saturday、Sunday)
(2)取当前周的周一
hive (gmall)> select date_add(next_day('2019-02-12','MO'),-7);
2019-02-11
4)last_day函数(求当月最后一天日期)
hive (gmall)> select last_day('2019-02-10');
2019-02-28
DWS层加载数据脚本
1)在hadoop102的/home/atguigu/bin目录下创建脚本
[atguigu@hadoop102 bin]$ vim dws_uv_log.sh
在脚本中编写如下内容
#!/bin/bash
# 定义变量方便修改
APP=gmall
hive=/opt/module/hive/bin/hive
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table "$APP".dws_uv_detail_day partition(dt='$do_date')
select
mid_id,
concat_ws('|', collect_set(user_id)) user_id,
concat_ws('|', collect_set(version_code)) version_code,
concat_ws('|', collect_set(version_name)) version_name,
concat_ws('|', collect_set(lang)) lang,
concat_ws('|', collect_set(source)) source,
concat_ws('|', collect_set(os)) os,
concat_ws('|', collect_set(area)) area,
concat_ws('|', collect_set(model)) model,
concat_ws('|', collect_set(brand)) brand,
concat_ws('|', collect_set(sdk_version)) sdk_version,
concat_ws('|', collect_set(gmail)) gmail,
concat_ws('|', collect_set(height_width)) height_width,
concat_ws('|', collect_set(app_time)) app_time,
concat_ws('|', collect_set(network)) network,
concat_ws('|', collect_set(lng)) lng,
concat_ws('|', collect_set(lat)) lat
from "$APP".dwd_start_log
where dt='$do_date'
group by mid_id;
insert overwrite table "$APP".dws_uv_detail_wk partition(wk_dt)
select
mid_id,
concat_ws('|', collect_set(user_id)) user_id,
concat_ws('|', collect_set(version_code)) version_code,
concat_ws('|', collect_set(version_name)) version_name,
concat_ws('|', collect_set(lang)) lang,
concat_ws('|', collect_set(source)) source,
concat_ws('|', collect_set(os)) os,
concat_ws('|', collect_set(area)) area,
concat_ws('|', collect_set(model)) model,
concat_ws('|', collect_set(brand)) brand,
concat_ws('|', collect_set(sdk_version)) sdk_version,
concat_ws('|', collect_set(gmail)) gmail,
concat_ws('|', collect_set(height_width)) height_width,
concat_ws('|', collect_set(app_time)) app_time,
concat_ws('|', collect_set(network)) network,
concat_ws('|', collect_set(lng)) lng,
concat_ws('|', collect_set(lat)) lat,
date_add(next_day('$do_date','MO'),-7),
date_add(next_day('$do_date','MO'),-1),
concat(date_add( next_day('$do_date','MO'),-7), '_' , date_add(next_day('$do_date','MO'),-1)
)
from "$APP".dws_uv_detail_day
where dt>=date_add(next_day('$do_date','MO'),-7) and dt<=date_add(next_day('$do_date','MO'),-1)
group by mid_id;
insert overwrite table "$APP".dws_uv_detail_mn partition(mn)
select
mid_id,
concat_ws('|', collect_set(user_id)) user_id,
concat_ws('|', collect_set(version_code)) version_code,
concat_ws('|', collect_set(version_name)) version_name,
concat_ws('|', collect_set(lang))lang,
concat_ws('|', collect_set(source)) source,
concat_ws('|', collect_set(os)) os,
concat_ws('|', collect_set(area)) area,
concat_ws('|', collect_set(model)) model,
concat_ws('|', collect_set(brand)) brand,
concat_ws('|', collect_set(sdk_version)) sdk_version,
concat_ws('|', collect_set(gmail)) gmail,
concat_ws('|', collect_set(height_width)) height_width,
concat_ws('|', collect_set(app_time)) app_time,
concat_ws('|', collect_set(network)) network,
concat_ws('|', collect_set(lng)) lng,
concat_ws('|', collect_set(lat)) lat,
date_format('$do_date','yyyy-MM')
from "$APP".dws_uv_detail_day
where date_format(dt,'yyyy-MM') = date_format('$do_date','yyyy-MM')
group by mid_id;
"
$hive -e "$sql"
2)增加脚本执行权限
[atguigu@hadoop102 bin]$ chmod 777 dws_uv_log.sh
3)脚本使用
[atguigu@hadoop102 module]$ dws_uv_log.sh 2019-02-11
4)查询结果
hive (gmall)> select count(*) from dws_uv_detail_day where dt='2019-02-11';
hive (gmall)> select count(*) from dws_uv_detail_wk;
hive (gmall)> select count(*) from dws_uv_detail_mn ;
5)脚本执行时间
企业开发中一般在每日凌晨30分~1点
ADS层加载数据脚本
1)在hadoop102的/home/atguigu/bin目录下创建脚本
[atguigu@hadoop102 bin]$ vi ads_uv_log.sh
在脚本中编写如下内容
#!/bin/bash
# 定义变量方便修改
APP=gmall
hive=/opt/module/hive/bin/hive
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
set hive.exec.dynamic.partition.mode=nonstrict;
insert into table "$APP".ads_uv_count
select
'$do_date' dt,
daycount.ct,
wkcount.ct,
mncount.ct,
if(date_add(next_day('$do_date','MO'),-1)='$do_date','Y','N') ,
if(last_day('$do_date')='$do_date','Y','N')
from
(
select
'$do_date' dt,
count(*) ct
from "$APP".dws_uv_detail_day
where dt='$do_date'
)daycount join
(
select
'$do_date' dt,
count (*) ct
from "$APP".dws_uv_detail_wk
where wk_dt=concat(date_add(next_day('$do_date','MO'),-7),'_' ,date_add(next_day('$do_date','MO'),-1) )
) wkcount on daycount.dt=wkcount.dt
join
(
select
'$do_date' dt,
count (*) ct
from "$APP".dws_uv_detail_mn
where mn=date_format('$do_date','yyyy-MM')
)mncount on daycount.dt=mncount.dt;
"
$hive -e "$sql"
2)增加脚本执行权限
chmod 777 ads_uv_log.sh
3)脚本使用
[atguigu@hadoop102 module]$ ads_uv_log.sh 2019-02-11
4)脚本执行时间
企业开发中一般在每日凌晨30分~1点
新数据准备
- 先关闭当前集群
cluster.sh stop
- 修改时间
dt.sh 2019-02-12
- 启动集群
cluster.sh stop
- 生成日志数据
lg.sh
- 将HDFS数据导入到ODS层
ods_log.sh 2019-02-12
- 将ODS数据导入到DWD层
[atguigu@hadoop102 ~]$ dwd_start_log.sh 2019-02-12
[atguigu@hadoop102 ~]$ dwd_base_log.sh 2019-02-12
[atguigu@hadoop102 ~]$ dwd_event_log.sh 2019-02-12
- 将DWD数据导入到DWS层
[atguigu@hadoop102 ~]$ dws_uv_log.sh 2019-02-12
- 验证
hive (gmall)> select * from dws_uv_detail_day where dt='2019-02-12' limit 2;
sqoop的安装
- 下载地址:http://mirrors.hust.edu.cn/apache/sqoop/1.4.6/
- 解压sqoop安装包到指定目录,如:
tar -zxvf sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz -C /opt/module/
- 修改sqoop的名称
mv sqoop-1.4.6.bin__hadoop-2.0.4-alpha/ sqoop
- 进入sqoop的conf目录,修改env文件名
cd conf/
mv sqoop-env-template.sh sqoop-env.sh
- 配置sqoop-env.sh文件
vi sqoop-env.sh
添加如下配置
export HADOOP_COMMON_HOME=/opt/module/hadoop-2.7.2
export HADOOP_MAPRED_HOME=/opt/module/hadoop-2.7.2
export HIVE_HOME=/opt/module/hive
export ZOOKEEPER_HOME=/opt/module/zookeeper-3.4.10
export ZOOCFGDIR=/opt/module/zookeeper-3.4.10
export HBASE_HOME=/opt/module/hbase
- 进入以下目录目录
cd /opt/software/mysql-libs/mysql-connector-java-5.1.27
- 拷贝jdbc驱动到sqoop的lib目录下,如:
sudo cp mysql-connector-java-5.1.27-bin.jar /opt/module/sqoop/lib/
- 进入sqoop的lib目录
cd /opt/module/sqoop/lib/
- 修改jar包权限
sudo chown atguigu:atguigu mysql-connector-java-5.1.27-bin.jar
- 进入sqoop目录,执行以下命令,验证sqoop
bin/sqoop help
- 测试Sqoop是否能够成功连接数据库
bin/sqoop list-databases --connect jdbc:mysql://hadoop102:3306/ --username root --password root
Sqoop定时导入脚本(注意mysql修改密码)
1)在/home/atguigu/bin目录下创建脚本sqoop_import.sh
vi sqoop_import.sh
在脚本中填写如下内容
#!/bin/bash
db_date=$2
echo $db_date
db_name=gmall
import_data() {
/opt/module/sqoop/bin/sqoop import \
--connect jdbc:mysql://hadoop102:3306/$db_name \
--username root \
--password root \
--target-dir /origin_data/$db_name/db/$1/$db_date \
--delete-target-dir \
--num-mappers 1 \
--fields-terminated-by "\t" \
--query "$2"' and $CONDITIONS;'
}
import_sku_info(){
import_data "sku_info" "select
id, spu_id, price, sku_name, sku_desc, weight, tm_id,
category3_id, create_time
from sku_info where 1=1"
}
import_user_info(){
import_data "user_info" "select
id, name, birthday, gender, email, user_level,
create_time
from user_info where 1=1"
}
import_base_category1(){
import_data "base_category1" "select
id, name from base_category1 where 1=1"
}
import_base_category2(){
import_data "base_category2" "select
id, name, category1_id from base_category2 where 1=1"
}
import_base_category3(){
import_data "base_category3" "select id, name, category2_id from base_category3 where 1=1"
}
import_order_detail(){
import_data "order_detail" "select
od.id,
order_id,
user_id,
sku_id,
sku_name,
order_price,
sku_num,
o.create_time
from order_info o, order_detail od
where o.id=od.order_id
and DATE_FORMAT(create_time,'%Y-%m-%d')='$db_date'"
}
import_payment_info(){
import_data "payment_info" "select
id,
out_trade_no,
order_id,
user_id,
alipay_trade_no,
total_amount,
subject,
payment_type,
payment_time
from payment_info
where DATE_FORMAT(payment_time,'%Y-%m-%d')='$db_date'"
}
import_order_info(){
import_data "order_info" "select
id,
total_amount,
order_status,
user_id,
payment_way,
out_trade_no,
create_time,
operate_time
from order_info
where (DATE_FORMAT(create_time,'%Y-%m-%d')='$db_date' or DATE_FORMAT(operate_time,'%Y-%m-%d')='$db_date')"
}
case $1 in
"base_category1")
import_base_category1
;;
"base_category2")
import_base_category2
;;
"base_category3")
import_base_category3
;;
"order_info")
import_order_info
;;
"order_detail")
import_order_detail
;;
"sku_info")
import_sku_info
;;
"user_info")
import_user_info
;;
"payment_info")
import_payment_info
;;
"all")
import_base_category1
import_base_category2
import_base_category3
import_order_info
import_order_detail
import_sku_info
import_user_info
import_payment_info
;;
esac
2)增加脚本执行权限
chmod 777 sqoop_import.sh
3)执行脚本导入数据(先重启集群)
[atguigu@hadoop102 bin]$ sqoop_import.sh all 2019-02-10
4)在SQLyog中生成2019年2月11日数据
CALL init_data('2019-02-11',1000,200,300,TRUE);
5)执行脚本导入数据
[atguigu@hadoop102 bin]$ sqoop_import.sh all 2019-02-11
ODS层数据导入脚本
1)在/home/atguigu/bin目录下创建脚本ods_db.sh
vi ods_db.sh
在脚本中填写如下内容
#!/bin/bash
APP=gmall
hive=/opt/module/hive/bin/hive
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
load data inpath '/origin_data/$APP/db/order_info/$do_date' OVERWRITE into table "$APP".ods_order_info partition(dt='$do_date');
load data inpath '/origin_data/$APP/db/order_detail/$do_date' OVERWRITE into table "$APP".ods_order_detail partition(dt='$do_date');
load data inpath '/origin_data/$APP/db/sku_info/$do_date' OVERWRITE into table "$APP".ods_sku_info partition(dt='$do_date');
load data inpath '/origin_data/$APP/db/user_info/$do_date' OVERWRITE into table "$APP".ods_user_info partition(dt='$do_date');
load data inpath '/origin_data/$APP/db/payment_info/$do_date' OVERWRITE into table "$APP".ods_payment_info partition(dt='$do_date');
load data inpath '/origin_data/$APP/db/base_category1/$do_date' OVERWRITE into table "$APP".ods_base_category1 partition(dt='$do_date');
load data inpath '/origin_data/$APP/db/base_category2/$do_date' OVERWRITE into table "$APP".ods_base_category2 partition(dt='$do_date');
load data inpath '/origin_data/$APP/db/base_category3/$do_date' OVERWRITE into table "$APP".ods_base_category3 partition(dt='$do_date');
"
$hive -e "$sql"
2)增加脚本执行权限
[atguigu@hadoop102 bin]$ chmod 777 ods_db.sh
3)执行脚本导入数据
[atguigu@hadoop102 bin]$ ods_db.sh 2019-02-10
[atguigu@hadoop102 bin]$ ods_db.sh 2019-02-11
4)查询导入数据
hive (gmall)>
select * from ods_order_info where dt='2019-02-10' limit 1;
select * from ods_order_info where dt='2019-02-11' limit 1;
DWD层数据导入脚本
1)在/home/atguigu/bin目录下创建脚本dwd_db.sh
vi dwd_db.sh
在脚本中填写如下内容
#!/bin/bash
# 定义变量方便修改
APP=gmall
hive=/opt/module/hive/bin/hive
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table "$APP".dwd_order_info partition(dt)
select * from "$APP".ods_order_info
where dt='$do_date' and id is not null;
insert overwrite table "$APP".dwd_order_detail partition(dt)
select * from "$APP".ods_order_detail
where dt='$do_date' and id is not null;
insert overwrite table "$APP".dwd_user_info partition(dt)
select * from "$APP".ods_user_info
where dt='$do_date' and id is not null;
insert overwrite table "$APP".dwd_payment_info partition(dt)
select * from "$APP".ods_payment_info
where dt='$do_date' and id is not null;
insert overwrite table "$APP".dwd_sku_info partition(dt)
select
sku.id,
sku.spu_id,
sku.price,
sku.sku_name,
sku.sku_desc,
sku.weight,
sku.tm_id,
sku.category3_id,
c2.id category2_id,
c1.id category1_id,
c3.name category3_name,
c2.name category2_name,
c1.name category1_name,
sku.create_time,
sku.dt
from
"$APP".ods_sku_info sku
join "$APP".ods_base_category3 c3 on sku.category3_id=c3.id
join "$APP".ods_base_category2 c2 on c3.category2_id=c2.id
join "$APP".ods_base_category1 c1 on c2.category1_id=c1.id
where sku.dt='$do_date' and c2.dt='$do_date'
and c3.dt='$do_date' and c1.dt='$do_date'
and sku.id is not null;
"
$hive -e "$sql"
2)增加脚本执行权限
[atguigu@hadoop102 bin]$ chmod 777 dwd_db.sh
3)执行脚本导入数据
[atguigu@hadoop102 bin]$ dwd_db.sh 2019-02-10
[atguigu@hadoop102 bin]$ dwd_db.sh 2019-02-11
4)查看导入数据
hive (gmall)>
select * from dwd_sku_info where dt='2019-02-10' limit 2;
select * from dwd_sku_info where dt='2019-02-11' limit 2;
用户行为数据宽表导入脚本
1)在/home/atguigu/bin目录下创建脚本dws_db_wide.sh
vi dws_db_wide.sh
在脚本中填写如下内容
#!/bin/bash
# 定义变量方便修改
APP=gmall
hive=/opt/module/hive/bin/hive
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
with
tmp_order as
(
select
user_id,
sum(oi.total_amount) order_amount,
count(*) order_count
from "$APP".dwd_order_info oi
where date_format(oi.create_time,'yyyy-MM-dd')='$do_date'
group by user_id
) ,
tmp_payment as
(
select
user_id,
sum(pi.total_amount) payment_amount,
count(*) payment_count
from "$APP".dwd_payment_info pi
where date_format(pi.payment_time,'yyyy-MM-dd')='$do_date'
group by user_id
),
tmp_comment as
(
select
user_id,
count(*) comment_count
from "$APP".dwd_comment_log c
where date_format(c.dt,'yyyy-MM-dd')='$do_date'
group by user_id
)
Insert overwrite table "$APP".dws_user_action partition(dt='$do_date')
select
user_actions.user_id,
sum(user_actions.order_count),
sum(user_actions.order_amount),
sum(user_actions.payment_count),
sum(user_actions.payment_amount),
sum(user_actions.comment_count)
from
(
select
user_id,
order_count,
order_amount,
0 payment_count,
0 payment_amount,
0 comment_count
from tmp_order
union all
select
user_id,
0,
0,
payment_count,
payment_amount,
0
from tmp_payment
union all
select
user_id,
0,
0,
0,
0,
comment_count
from tmp_comment
) user_actions
group by user_id;
"
$hive -e "$sql"
2)增加脚本执行权限
chmod 777 dws_db_wide.sh
3)执行脚本导入数据
dws_db_wide.sh 2019-02-11
4)查看导入数据
hive (gmall)>
select * from dws_user_action where dt='2019-02-11' limit 2;
ADS层数据导入脚本
1)在/home/atguigu/bin目录下创建脚本ads_db_gmv.sh
vi ads_db_gmv.sh
在脚本中填写如下内容
#!/bin/bash
# 定义变量方便修改
APP=gmall
hive=/opt/module/hive/bin/hive
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
insert into table "$APP".ads_gmv_sum_day
select
'$do_date' dt,
sum(order_count) gmv_count,
sum(order_amount) gmv_amount,
sum(payment_amount) payment_amount
from "$APP".dws_user_action
where dt ='$do_date'
group by dt;
"
$hive -e "$sql"
2)增加脚本执行权限
chmod 777 ads_db_gmv.sh
3)执行脚本导入数据
ads_db_gmv.sh 2019-02-11
4)查看导入数据
hive (gmall)>
select * from ads_gmv_sum_day where dt='2019-02-11' limit 2;
DWS层数据导入脚本
1)在/home/atguigu/bin目录下创建脚本dws_sale.sh
vi dws_sale.sh
在脚本中填写如下内容
#!/bin/bash
# 定义变量方便修改
APP=gmall
hive=/opt/module/hive/bin/hive
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
set hive.exec.dynamic.partition.mode=nonstrict;
with
tmp_detail as
(
select
user_id,
sku_id,
sum(sku_num) sku_num,
count(*) order_count,
sum(od.order_price*sku_num) order_amount
from "$APP".dwd_order_detail od
where od.dt='$do_date'
group by user_id, sku_id
)
insert overwrite table "$APP".dws_sale_detail_daycount partition(dt='$do_date')
select
tmp_detail.user_id,
tmp_detail.sku_id,
u.gender,
months_between('$do_date', u.birthday)/12 age,
u.user_level,
price,
sku_name,
tm_id,
category3_id,
category2_id,
category1_id,
category3_name,
category2_name,
category1_name,
spu_id,
tmp_detail.sku_num,
tmp_detail.order_count,
tmp_detail.order_amount
from tmp_detail
left join "$APP".dwd_user_info u
on tmp_detail.user_id=u.id and u.dt='$do_date'
left join "$APP".dwd_sku_info s on tmp_detail.sku_id =s.id and s.dt='$do_date';
"
$hive -e "$sql"
2)增加脚本执行权限
chmod 777 dws_sale.sh
3)执行脚本导入数据
dws_sale.sh 2019-02-11
4)查看导入数据
hive (gmall)>
select * from dws_sale_detail_daycount where dt='2019-02-11' limit 2;
ADS层数据导入脚本
1)在/home/atguigu/bin目录下创建脚本ads_sale.sh
vi ads_sale.sh
在脚本中填写如下内容
#!/bin/bash
# 定义变量方便修改
APP=gmall
hive=/opt/module/hive/bin/hive
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
set hive.exec.dynamic.partition.mode=nonstrict;
insert into table "$APP".ads_sale_tm_category1_stat_mn
select
mn.sku_tm_id,
mn.sku_category1_id,
mn.sku_category1_name,
sum(if(mn.order_count>=1,1,0)) buycount,
sum(if(mn.order_count>=2,1,0)) buyTwiceLast,
sum(if(mn.order_count>=2,1,0))/sum( if(mn.order_count>=1,1,0)) buyTwiceLastRatio,
sum(if(mn.order_count>=3,1,0)) buy3timeLast,
sum(if(mn.order_count>=3,1,0))/sum( if(mn.order_count>=1,1,0)) buy3timeLastRatio ,
date_format('$do_date' ,'yyyy-MM') stat_mn,
'$do_date' stat_date
from
(
select
user_id,
od.sku_tm_id,
od.sku_category1_id,
od.sku_category1_name,
sum(order_count) order_count
from "$APP".dws_sale_detail_daycount od
where date_format(dt,'yyyy-MM')=date_format('$do_date' ,'yyyy-MM')
group by user_id, od.sku_tm_id, od.sku_category1_id, od.sku_category1_name
) mn
group by mn.sku_tm_id, mn.sku_category1_id, mn.sku_category1_name;
"
$hive -e "$sql"
2)增加脚本执行权限
chmod 777 ads_sale.sh
3)执行脚本导入数据
ads_sale.sh 2019-02-11
4)查看导入数据
hive (gmall)>
select * from ads_sale_tm_category1_stat_mn limit 2;
Sqoop导出脚本(注意密码修改)
1)编写Sqoop导出脚本
在/home/atguigu/bin目录下创建脚本sqoop_export.sh
vi sqoop_export.sh
在脚本中填写如下内容
#!/bin/bash
db_name=gmall
export_data() {
/opt/module/sqoop/bin/sqoop export \
--connect "jdbc:mysql://hadoop102:3306/${db_name}?useUnicode=true&characterEncoding=utf-8" \
--username root \
--password root \
--table $1 \
--num-mappers 1 \
--export-dir /warehouse/$db_name/ads/$1 \
--input-fields-terminated-by "\t" \
--update-mode allowinsert \
--update-key "tm_id,category1_id,stat_mn,stat_date" \
--input-null-string '\\N' \
--input-null-non-string '\\N'
}
case $1 in
"ads_uv_count")
export_data "ads_uv_count"
;;
"ads_user_action_convert_day")
export_data "ads_user_action_convert_day"
;;
"ads_gmv_sum_day")
export_data "ads_gmv_sum_day"
;;
"all")
export_data "ads_uv_count"
export_data "ads_user_action_convert_day"
export_data "ads_gmv_sum_day"
;;
esac
关于导出update还是insert的问题
-
–update-mode:
updateonly 只更新,无法插入新数据
allowinsert 允许新增 -
–update-key:允许更新的情况下,指定哪些字段匹配视为同一条数据,进行更新而不增加。多个字段用逗号分隔。
-
–input-null-string和–input-null-non-string:
分别表示,将字符串列和非字符串列的空串和“null”转换成’\N’。
官网地址:http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html -
Hive中的Null在底层是以“\N”来存储,而MySQL中的Null在底层就是Null,为了保证数据两端的一致性。在导出数据时采用–input-null-string和–input-null-non-string两个参数。导入数据时采用–null-string和–null-non-string。
3)执行Sqoop导出脚本
[atguigu@hadoop102 bin]$ chmod 777 sqoop_export.sh
[atguigu@hadoop102 bin]$ sqoop_export.sh all
4)在MySQL中查看结果
SELECT * FROM ads_uv_count;
SELECT * FROM ads_user_retention_day_rate;
SELECT * FROM ads_user_action_convert_day;
SELECT * FROM ads_gmv_sum_day;
SELECT * FROM ads_gmv_sum_province;
结果展示
localhost:8080/active