1. 静态分区
若分区的值是确定的,那么称为静态分区。下面举例一个业务场景:
统计程序每天都需要统计1号店中由精准化推荐带来的销售额,日期是确定的,现在需要将每天统计好的销售额数据插入到指定的日期分区中
1.1. 实现方式
参见如下的脚本:
hive -e "
--设置Hive Job任务队列
set mapred.job.queue.name=pms;
--创建精准化销售额数据汇总表
create table if not exists pms.rpt_rcmd_gmv
(
page_name string,
section_name string,
order_count bigint,
order_amount double
)
PARTITIONED BY (ds string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;"
#----------------------------------------------
# 获取当天日期
date=`date +"%Y-%m-%d"`
hive -e "
--设置Hive Job任务队列
set mapred.job.queue.name=pms;
--将数据插入到指定的日期分区中
insert overwrite table pms.rpt_rcmd_gmv partition(ds='$date')
select distinct
page_name,
section_name,
order_count,
order_amount
from xxx;"
2. 动态分区
顾名思义,分区的值是非确定的,由输入数据来确定。下面举例一个业务场景:
一个二级类目对应着多个产品,现在想要按二级类目进行分区,问题在于输入数据中包含不止一个二级类目,所以分区的值是非确定的,需要根据输入数据来确定。
2.1. 实现方式
- 动态分区的字段,需要写在
select
语句中所有字段的最后 - hive需要设置
set hive.exec.dynamic.partition=true;
(默认值是false
,表示是否开启动态分区) - [可选]hive需要设置
set hive.exec.dynamic.partition.mode=nonstrict;
(默认是strict
模式,表示至少需要指定一个静态分区;nonstrict
模式表示不需要指定静态分区)
由于上述举例中的业务场景只需要动态设置二级类目分区,所以开启了以下模式:
set hive.exec.dynamic.partition.mode=nonstrict;
完整的脚本如下:
--建立hive分区表,以二级类目id作为分区,字段之间以\t分隔,行之间以\n分隔
drop table if exists pms.dynamic_groupon;
create table if not exists pms.dynamic_groupon
(
product_id bigint,
area_id string,
type int,
original_price double,
price double,
start_time string,
end_time string,
groupon_num int,
ds string
)
PARTITIONED BY (category_lvl2_id bigint)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
--根据query的结果,动态将数据插入到对应的二级类目分区中
set hive.exec.reducers.max=32;
set mapred.reduce.tasks=32;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set mapred.job.name=[HQL]dynamic_groupon;
insert overwrite table pms.dynamic_groupon partition(category_lvl2_id)
select
cast(a.productid as bigint) as product_id,
a.areaid as area_id,
cast(a.type as int) as type,
cast(a.originalprice as double) as original_price,
cast(a.price as double) as price,
a.start_time,
a.end_time,
cast(a.grouponnum as int) as groupon_num,
a.ds,
c.categ_lvl2_id as category_lvl2_id
from pms.pms_all_groupon_items a
left outer join product b on (cast(a.productid as bigint) = b.id)
left outer join (
select distinct
categ_lvl_id,
categ_lvl2_id
from dw.hier_categ_by_orig
) c on (b.category_id = c.categ_lvl_id)
where a.ds = '2015-10-18';
转载自:https://blog.csdn.net/yeweiouyang/article/details/52560078