需求
1、在会员分析中计算最近七天连续三天活跃会员数。
2、项目的数据采集过程中,有哪些地方能够优化,如何实现?
解题思路
求最近七天连续三天活跃的会员数
- 从dws.dws_member_start_day 这张每日会员登录表中已经知道了,每日登录的会员
- 从求出的时间往前推7天确认数据范围
- 使用排序函数 row_number,按照device_id分组,dt排序,用dt减去row——number得到分组标识字段gid
- 再根据gid和device_id进行分组,用连续三天这个条件进行过滤
hue测试代码使用2020-07-28的数据
with tmp as (
select
device_id, dt,
date_sub(dt, row_number() over (partition by device_id order by dt)) gid
from
dws.dws_member_start_day
where dt between date_sub("2020-07-28",7) and "2020-07-28"
)
select device_id,count(*) from tmp group by device_id, gid having count(*)>=2;
建表,并编写脚本
drop table if exists dws.dws_member_start_continue_three_day;
create table dws.dws_member_start_continue_three_day(
`device_id` string,
`uid` string,
`app_v` string,
`os_type` string,
`language` string,
`channel` string,
`area` string,
`brand` string,
`date` string,
`count` int
) COMMENT '最近七天连续三天登录会员表'
PARTITIONED BY (`dt` string)
stored as parquet
加载数据脚本
#!/bin/bash
source /etc/profile
# 可以输入日期;如果未输入日期取昨天的时间
if [ -n "$1" ]
then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
# 定义要执行的SQL
sql="
with tmp as (
)
insert overwrite table dws.dws_member_start_continue_three_day
partition(dt='$do_date')
select device_id,
concat_ws('|', collect_set(uid)),
concat_ws('|', collect_set(app_v)),
concat_ws('|', collect_set(os_type)),
concat_ws('|', collect_set(language)),
concat_ws('|', collect_set(channel)),
concat_ws('|', collect_set(area)),
concat_ws('|', collect_set(brand)),
'$do_date',
count(*)
from (
select
device_id, dt,uid,app_v,os_type,language,channel,area,brand,
date_sub(dt, row_number() over (partition by device_id order by dt)) gid
from
dws.dws_member_start_day
where dt between date_sub('$do_date',7) and '$do_date'
) tmp
where dt='$do_date'
group by device_id, gid having count(*)>=3;
"
hive -e "$sql"
项目的数据采集过程中,有哪些地方能够优化,如何实现?
JAVA内存的设计
主要通过修改 conf/flume-env.sh文件实现
主要设计Xmx和Xms两个参数,可以根据OS内存的大小进行合理设置,在flume-env.sh文件中增加参数
export JAVA_OPTS="-Xms4000m -Xmx4000m -Dcom.sun.management.jmxremote"