背景
开发中,经常遇到一些固化的报表需求,需要按固定的频率(每天、每周、每月)来定时生成。
定时执行Hive脚本,根据调度的复杂程度,可以使用Azkaban,也可以选择更为轻量级的Linux crontab。
这里以crontab为例进行调度任务的配置。
报表要求
0、统计上月订单销售明细情况;
1、报表以Excel方式生成;
2、可能报表数据量比较大,如果数据量超过预设阈值(如50万),就按阈值分为N+1个文件;
3、受限于Excel的文件大小限制(Excel2007及之后,如果超过Excel的104万行限制,后续的数据会不显示);
4、生成的每个文件的第一行都需要携带完整的中文字段头。
5、生成好的报表文件打包成zip包;
6、zip报上传到hdfs指定目录。
开发中遇到的几个坑
1、为了实现每个文件中都要包含固定的表头,这里使用了生成临时数据的方式(如下),并使用union all语法 与原来的数据拼装起来。
注意,为了实现中文字段行始终排在第一位,这里使用了一列虚拟列seq_num来实现,在中文字段列的构建时让其排序为1,数据列中让它们排序为2,再综合排序,得到最终的效果。
- select
‘订单日期’ as order_date,
‘订单编号’ as order_no,
‘产品名称’ as product_name,
‘数量’ as quantity,
‘单位’ as unit,
‘商品图片’ as images,
1 as seq_num
2、由于union all前的数据均为string类型,而dwd.order_info 表中quantity与unit字段前期定义为了decimal(15,2)类型(其他为string),导致union all之后的数据,经过seq_num排序后,‘数量’ 和 ‘单位’ 这2列无法正常显示中文字段名,而是NULL。
解决这个问题,只需要在union all之下,子查询之上进行强制类型转换,转换为string类型即可,如下:
- cast(a.quantity as string) quantity,
- cast(a.unit as string) unit,
3、由于union all前的数据均为string类型,而dwd.order_info表的images字段定义为了 array类型,导致类型不匹配,执行报错。
解决这个问题,可以将array类型转换为string类型,使用concat_ws(sep,string*)函数来实现,如下:
- concat_ws(’,’,images) images
shell脚本清单
#!/bin/bash
source /etc/profile
#set -x
function _time(){
ctime=`date +%Y-%m-%dT%k:%M:%S`
echo "$ctime" | sed 's/[ ][ ]*/0/g'
}
echo ">>>>Script $0 start at : "`_time`
SCRIPT_PATH="$(cd $(dirname $0) && pwd)"
echo '>>>>Current dir is:'$SCRIPT_PATH
# Do clean jobs, delete file which generated last time.
##rm -f /var/lib/hadoop-hdfs/my_bigdata_report/REPORT_OUTPUT/*
rm -f $SCRIPT_PATH/REPORT_OUTPUT/*
# eg: 2021
YEAR=$1
# eg: 2021-03-01
ORDER_DATE_START=$2
# eg: 2021-03-31
ORDER_DATE_END=$3
#Give default value
if [ -z "$YEAR" ]; then
YEAR=`date -d "1 month ago" +%Y`
fi
#获取上月首日
if [ -z "$ORDER_DATE_START" ]; then
ORDER_DATE_START="`date -d "1 month ago" +%Y-%m`-01"
fi
#获取上月最后一天
if [ -z "$ORDER_DATE_END" ]; then
ORDER_DATE_END=$(date -d"`date +%Y-%m-01` last day" +%Y-%m-%d)
fi
echo ">>>>YEAR="${YEAR}" ,ORDER_DATE_START="${ORDER_DATE_START}" ,ORDER_DATE_END="${ORDER_DATE_END}
TOP_PID=$$
echo '>>>>TOP_PID'=$TOP_PID
usage(){
echo -e " Usage : \n \t $0 [ usage ] | ( <year> <start_date> <end_date> )"
echo -e "\t ============= Param list : ==============="
echo -e "\t year : Year of this query."
echo -e "\t start_date : Start day of this query"
echo -e "\t end_date : End day of this query"
kill -s TERM $TOP_PID
}
#if [ $# != 3 ] || [ x"$1" = x"usage" ]; then
# usage
#fi
REPORT_OUTPUT_PATH=$SCRIPT_PATH/REPORT_OUTPUT
TODAY=$(date -d "0 days ago" +%Y%m%d)
REPORT_OUTPUT_FILE=report1_${TODAY}生成.xlsx
FILE_PREFIX=${REPORT_OUTPUT_FILE%.*}
FILE_EXTENSION=${REPORT_OUTPUT_FILE##*.}
EXCEL_OPEN_MAXNUM=500000
hive -e "
with tmp as (
select
'订单日期' as order_date,
'订单编号' as order_no,
'产品名称' as product_name,
'数量' as quantity,
'单位' as unit,
'商品图片' as images,
1 as seq_num
UNION ALL
select
a.order_date,
a.order_no,
b.product_name,
cast(a.quantity as string) quantity,
cast(a.unit as string) unit,
a.images,
2 as seq_num
from
(
SELECT
from_unixtime(
unix_timestamp(order_date, 'yyyyMMddHH'),
'yyyy/MM/dd'
) use_date,
order_no,
product_id,
product_name,
quantity,
unit,
concat_ws(',',images) images
FROM
dwd.order_info
where
year = '${YEAR}'
and month = month(add_months(current_date, -1))
and DATE_FORMAT(order_date, 'yyyy-MM-dd HH:mm:ss') >= '${ORDER_DATE_START} 00:00:00'
and DATE_FORMAT(order_date, 'yyyy-MM-dd HH:mm:ss') < '${ORDER_DATE_END} 00:00:00'
and product_name is not null
and product_name != ''
and product_name != 'null'
) a
left join (
select
product_id,
product_name,
from
ods.ods_product_db_product
where
is_out_of_stock = 0 --是否在售,0:在售;-1:不在售
and trim(product_id) <> ''
) b on a.product_id = b.product_id
)
select
order_date,
order_no,
product_name,
quantity,
unit,
images,
from
(
select
*
from
tmp
order by
seq_num
) tmp
" > $REPORT_OUTPUT_PATH/${REPORT_OUTPUT_FILE}
# Delete "WARN:...." statement generated by hive
#
# Statement like these:
# WARN: The method class org.apache.commons.logging.impl.SLF4JLogFactory#release() was invoked.
# WARN: Please see http://www.slf4j.org/codes.html#release for an explanation.
sed -i '/^WARN:.*/d' $REPORT_OUTPUT_PATH/${REPORT_OUTPUT_FILE}
line_num=`wc -l $REPORT_OUTPUT_PATH/${REPORT_OUTPUT_FILE} | awk '{print $1}'`
# How many excel files should split
file_num=$(($line_num / $EXCEL_OPEN_MAXNUM + 1))
echo ">>>> Origin file has $line_num lines, each report file max record is set to : $EXCEL_OPEN_MAXNUM, it will split into $file_num files"
cd $REPORT_OUTPUT_PATH
#INDEX_HEADER="订单日期 订单编号 产品名称 数量 单位 商品图片"
INDEX_HEADER=`head -n 1 ./${REPORT_OUTPUT_FILE}`
echo 'INDEX_HEADER===============' + $INDEX_HEADER
# Do split big file into small fils operation
split -l $EXCEL_OPEN_MAXNUM -d -a 2 ./${REPORT_OUTPUT_FILE} ${FILE_PREFIX}-
# Add standard header to each split file
i=0
echo i=============$i
#ls | grep -v $REPORT_OUTPUT_FILE | while read f;do sed -i "1i ${INDEX_HEADER}" $f;mv $f $f.$FILE_EXTENSION;done
ls | grep -v $REPORT_OUTPUT_FILE | while read f;do echo ">>>>Loop number is: $i, split file is: $f"; mv $f $f.$FILE_EXTENSION; if [ $i -ne 0 ];then sed -i "1i ${INDEX_HEADER}" $f.$FILE_EXTENSION; fi; ((i++)); done
#tar -czf ${FILE_PREFIX}.tar.gz ${FILE_PREFIX}-part-*
echo ">>>>Start to zip files."
zip -r ${FILE_PREFIX}.zip ${FILE_PREFIX}-*
HDFS_REPORT_PATH=/my_test/report_output/report1
echo ">>>>Start to upload result to hdfs "
hdfs dfs -put ${FILE_PREFIX}.zip $HDFS_REPORT_PATH
最后配置crontab定时调度:
# 该脚本设置在每年8月每天11:30执行,每日仅执行一次
30 11 * 8 * sh /var/lib/hadoop-hdfs/my_bigdata_report/report1_generate_script.sh >> /var/lib/hadoop-hdfs/my_bigdata_report/report1_generate_script.sh.out