[Hive]Hive静态分区与动态分区

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/yeweiouyang/article/details/52560078

1. 静态分区

若分区的值是确定的,那么称为静态分区。下面举例一个业务场景:

统计程序每天都需要统计1号店中由精准化推荐带来的销售额,日期是确定的,现在需要将每天统计好的销售额数据插入到指定的日期分区中

1.1. 实现方式

参见如下的脚本:

hive -e "
--设置Hive Job任务队列
set mapred.job.queue.name=pms;

--创建精准化销售额数据汇总表
create table if not exists pms.rpt_rcmd_gmv 
(
    page_name       string,
    section_name    string,
    order_count     bigint,
    order_amount    double
) 
PARTITIONED BY (ds string)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t' 
LINES TERMINATED BY '\n' 
STORED AS TEXTFILE;"

#----------------------------------------------

# 获取当天日期
date=`date +"%Y-%m-%d"`

hive -e "
--设置Hive Job任务队列
set mapred.job.queue.name=pms;

--将数据插入到指定的日期分区中
insert overwrite table pms.rpt_rcmd_gmv partition(ds='$date') 
select distinct 
    page_name,
    section_name,
    order_count,
    order_amount
from xxx;"

2. 动态分区

顾名思义,分区的值是非确定的,由输入数据来确定。下面举例一个业务场景:

一个二级类目对应着多个产品,现在想要按二级类目进行分区,问题在于输入数据中包含不止一个二级类目,所以分区的值是非确定的,需要根据输入数据来确定。

2.1. 实现方式

  1. 动态分区的字段,需要写在select语句中所有字段的最后
  2. hive需要设置set hive.exec.dynamic.partition=true;(默认值是false,表示是否开启动态分区)
  3. [可选]hive需要设置set hive.exec.dynamic.partition.mode=nonstrict;(默认是strict模式,表示至少需要指定一个静态分区;nonstrict模式表示不需要指定静态分区)

由于上述举例中的业务场景只需要动态设置二级类目分区,所以开启了以下模式:

set hive.exec.dynamic.partition.mode=nonstrict;

完整的脚本如下:

--建立hive分区表,以二级类目id作为分区,字段之间以\t分隔,行之间以\n分隔
drop table if exists pms.dynamic_groupon;
create table if not exists pms.dynamic_groupon 
(
    product_id          bigint,
    area_id             string,
    type                int,
    original_price      double,
    price               double,
    start_time          string,
    end_time            string,
    groupon_num         int,
    ds                  string
) 
PARTITIONED BY (category_lvl2_id bigint) 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t' 
LINES TERMINATED BY '\n' 
STORED AS TEXTFILE;

--根据query的结果,动态将数据插入到对应的二级类目分区中
set hive.exec.reducers.max=32;
set mapred.reduce.tasks=32;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set mapred.job.name=[HQL]dynamic_groupon;

insert overwrite table pms.dynamic_groupon partition(category_lvl2_id)
select 
    cast(a.productid as bigint) as product_id,
    a.areaid as area_id, 
    cast(a.type as int) as type,
    cast(a.originalprice as double) as original_price,
    cast(a.price as double) as price,
    a.start_time,
    a.end_time,
    cast(a.grouponnum as int) as groupon_num,
    a.ds,
    c.categ_lvl2_id as category_lvl2_id 
from pms.pms_all_groupon_items a 
left outer join product b on (cast(a.productid as bigint) = b.id)
left outer join (
    select distinct 
        categ_lvl_id,
        categ_lvl2_id 
    from dw.hier_categ_by_orig  
) c on (b.category_id = c.categ_lvl_id) 
where a.ds = '2015-10-18';

动态分区可以参考Hive Language Manual

阅读更多
想对作者说点什么? 我来说一句

没有更多推荐了,返回首页