跟我一起学【Hive】之——静态分区、动态分区

最新推荐文章于 2024-05-18 09:00:00 发布

置顶 SunWuKong_Hadoop

最新推荐文章于 2024-05-18 09:00:00 发布

阅读量508

点赞数 1

分类专栏： hive spark

本文链接：https://blog.csdn.net/SunWuKong_Hadoop/article/details/85328771

版权

hive 同时被 2 个专栏收录

36 篇文章 0 订阅

订阅专栏

spark

30 篇文章 1 订阅

订阅专栏

虽然之前已经用过很多次hive的分区表，但是还是找时间快速回顾总结一下加深理解.

举个栗子，基本需求就是Hive有一张非常详细的原子数据表original_device_open，而且还在不断随着时间增长，那么我需要给它进行分区，为什么要分区?因为我想缩小查询范围，提高速度和性能.

分区其实是物理上对hdfs不同目录进行数据的load操作，0.7之后的版本都会自动创建不存在的hdfs的目录，不同的目录对应不同的分区字段，当然会有一个处于最顶层的主分区字段.

我这里的分区字段主要是时间，分为年，月（ps:根据自身业务逻辑去设置动态分区的维度）

建表语句(指定动态分区列,源数据时间格式为yyyy-mm-dd hh:mm:ss，此处设定为年、月为动态分区)：
create table lqioc_ioc_ods.cltxjllq_modify2 (
xh string comment '序号',
lastid string comment '记录表的自增序列id',
communityid string comment '社区id',
communityname string comment '社区名称',
communityaddress string comment '社区地址',
communitylongitude string comment '社区经度坐标',
communitylatitude string comment '社区纬度坐标',
lotname string comment '当前车库名称',
gatename string comment '当前车库门口名称',
gatetype string comment '门口出入类型',
card string comment '车牌号',
bz string comment '备注',
createtime string comment '创建时间',
updatetime string comment '修改时间',
seq bigint comment '唯一索引',
updatestatus bigint comment '更新状态',
sys_key bigint comment '系统主键') 
comment '车辆通行记录拉取' 
partitioned by (inouttime_year string,inouttime_month string) 
row format delimited fields terminated by '' tblproperties('creator'='sunruzi','create_at'='2018-12-29 15:56:22');

打开动态分区功能
设置为nonstrict能够让所有的分区都动态被设定，否则的话至少需要指定一个分区值
能够被每个mapper或者reducer创建的最大动态分区的数目，如果一个mapper或者reducer试图创建多余这个值的动态分区数目，会报错
被一条带有动态分区的SQL语句所能创建的动态分区总量，如果超出限制会报错
全局能够被创建文件数目的最大值，专门有一个hadoop计数器来跟踪该值，如果超出会报错
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=600000;
set hive.exec.max.dynamic.partitions=6000000;
set hive.exec.max.created.files=6000000;

设定动态分区、插入数据一起执行
(否则会报错：Dynamic partition strict mode requires at least one static partition column. 
To turn this off set hive.exec.dynamic.partition.mode=nonstrict (state=,code=0))

set hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE lqioc_ioc_ods.CLTXJLLQ_modify2 PARTITION(inouttime_year,inouttime_month)
select 
XH,
LASTID,
COMMUNITYID,
COMMUNITYNAME,
COMMUNITYADDRESS,
COMMUNITYLONGITUDE,
COMMUNITYLATITUDE,
LOTNAME,
GATENAME,
GATETYPE,
CARD,
BZ,
CREATETIME,
UPDATETIME,
SEQ,
UPDATESTATUS,
SYS_KEY,
substr(INOUTTIME,1,4) as inouttime_year,
substr(INOUTTIME,6,2) as inouttime_month from lqioc_ioc_ods.CLTXJLLQ;

查看分区信息
show partitions lqioc_ioc_ods.CLTXJLLQ_modify2;
查看分区数据是否导入成功
select * from lqioc_ioc_ods.CLTXJLLQ_modify2 where inouttime_year=substr(from_unixtime(unix_timestamp()),1,4) 
and inouttime_month=substr(from_unixtime(unix_timestamp()),6,2);

原始表测试
SELECT SUBSTR(M.INOUTTIME,1,4) AS YEAR_ID,SUBSTR(M.INOUTTIME,6,2) AS MONTH_ID,
'鹿泉' AS AREA_NAME,
'130185000000' AS AREA_NO,
M.GATETYPE AS ORDER_TYPE,
'Z64' AS ORDER_TYPE_NO,
COMMUNITYNAME AS ORDER_ITEM,
COUNT(GATETYPE) AS ORDER_ITEM_NUM,
'' AS ORDER_TYPE_COST
FROM LQIOC_IOC_ODS.CLTXJLLQ M 
WHERE 
SUBSTR(M.INOUTTIME,1,4)=SUBSTR(FROM_UNIXTIME(UNIX_TIMESTAMP()),1,4)
AND 
SUBSTR(M.INOUTTIME,6,2)=SUBSTR(FROM_UNIXTIME(UNIX_TIMESTAMP()),6,2)
AND
M.GATETYPE=0
GROUP BY COMMUNITYNAME,SUBSTR(M.INOUTTIME,1,4),SUBSTR(M.INOUTTIME,6,2),GATETYPE;

分区表测试
select inouttime_year as year_id,inouttime_month as month_id,
'鹿泉' as area_name,
'130185000000' as area_no,
m.gatetype as order_type,
'z64' as order_type_no,
communityname as order_item,
count(gatetype) as order_item_num,
'' as order_type_cost
from lqioc_ioc_ods.CLTXJLLQ_modify2 m 
where 
inouttime_year=substr(from_unixtime(unix_timestamp()),1,4) 
and inouttime_month=substr(from_unixtime(unix_timestamp()),6,2)
and
m.gatetype=0
group by communityname,inouttime_year,inouttime_month,gatetype;

插入数据
INSERT INTO LQIOC_IOC_DW.PASQ_MODIFY2 
select inouttime_year as year_id,inouttime_month as month_id,
'鹿泉' as area_name,
'130185000000' as area_no,
m.gatetype as order_type,
'z64' as order_type_no,
communityname as order_item,
count(gatetype) as order_item_num,
'' as order_type_cost
from lqioc_ioc_ods.CLTXJLLQ_modify2 m 
where 
inouttime_year=substr(from_unixtime(unix_timestamp()),1,4) 
and inouttime_month=substr(from_unixtime(unix_timestamp()),6,2)
and
m.gatetype=0
group by communityname,inouttime_year,inouttime_month,gatetype;

测试结果：

单独select语句，提高效率136.4%

结合insert语句，提高效率32.7%

分区注意细节
　　(1)、尽量不要是用动态分区，因为动态分区的时候，将会为每一个分区分配reducer数量，当分区数量多的时候，reducer数量将会增加，对服务器是一种灾难。
　　(2)、动态分区和静态分区的区别，静态分区不管有没有数据都将会创建该分区，动态分区是有结果集将创建，否则不创建。
　　(3)、hive动态分区的严格模式和hive提供的hive.mapred.mode的严格模式。
　　hive提供我们一个严格模式：为了阻止用户不小心提交恶意hql
　　hive.mapred.mode=nostrict : strict
　　如果该模式值为strict，将会阻止以下三种查询：
　　(1)、对分区表查询，where中过滤字段不是分区字段。
　　(2)、笛卡尔积join查询，join查询语句，不带on条件或者 where条件。
　　 (3)、对order by查询，有order by的查询不带limit语句。

使用动态分区可以非常智能的加载表，而在动静结合使用时需要注意静态分区值必须在动态分区值的前面。所以建议使用动态分区，且insert进行分区的使用建议使用OVERWRITE，避免数据重复。

感谢：https://blog.csdn.net/qq_26369213/article/details/78998278

感谢：http://blog.sina.com.cn/s/blog_e98080dd0102x9rv.html

SunWuKong_Hadoop

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
跟我一起学【Hive】之——静态分区、动态分区

虽然之前已经用过很多次hive的分区表，但是还是找时间快速回顾总结一下加深理解.举个栗子，基本需求就是Hive有一张非常详细的原子数据表original_device_open，而且还在不断随着时间增长，那么我需要给它进行分区，为什么要分区?因为我想缩小查询范围，提高速度和性能.分区其实是物理上对hdfs不同目录进行数据的load操作，0.7之后的版本都会自动创建不存在的hdfs的目录，不...
复制链接

扫一扫

专栏目录