hive开发中对分区表的各种操作详解

最新推荐文章于 2024-07-29 20:39:23 发布

涤生大数据

最新推荐文章于 2024-07-29 20:39:23 发布

阅读量1.3w

点赞数 4

分类专栏： Hive/MR原理剖析/优化实战文章标签： hive表设置防删除和被查询保护 hive表的分区 hive分区 partition的使用 partitiioned by

本文为博主原创文章，属博主老婆资产，故未经博主老婆允许不得转载。

本文链接：https://blog.csdn.net/qq_26442553/article/details/79478888

版权

Hive/MR原理剖析/优化实战专栏收录该内容

74 篇文章 114 订阅

订阅专栏

hive开发中，在存储数据时，为了更快地查询数据和更好地管理数据，都会对hive表中数据进行分区存储。所谓的分区，在hive表中体现的是多了一个字段。而在底层文件存储系统中，比如HDFS上，分区则是一个文件夹，或者说是一个文件目录，不同的分区，就是数据存放在根目录下的不同子目录里。

1.查看表的分区以及分区结构

表的分区字段一般都是在建表的时候就已经设置好了。当然给表删除和增加分区前首先要知道表中分区有哪些，有没有分区，分区的字段有哪些？可以使用如下命令来查看表中的所有分区。

hive (fdm_sor)> show partitions t_fin_demo;
OK
partition
statis_date=201901
statis_date=201902
Time taken: 0.086 seconds, Fetched: 2 row(s)

说明：如果表里没有添加数据的话，是不会显示具体分区字段名的，要想知道表中具体分区字段名，可用describe formatted table_name此外，show partitions FDM_SOR.mytest_deptaddr partition(statis_date='20180303')，可以查看指定分区

2.给表增加删除分区

语法结构（[]括号内属于可选的配置）
ALTER TABLE table_name ADD [IF NOT EXISTS] partition_spec [ LOCATION 'location1' ] partition_spec [ LOCATION 'location2' ] ...
partition_spec:
: PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...)
ALTER TABLE table_nameDROPpartition_spec, partition_spec,...

具体实例演示：给表增加分区

hive (fdm_sor)> alter table mytest_tmp add partition(statis_date='20180304');
OK

1.给表增加或者删除分区，前提创建表时已经定义了分区字段statis_date.否则增加不了分区。同理在建表时定义分区字段，才可以drop分区，否则会报错。

2.如果给外部表（external table)增加分区时，要制定location。即存储位置。

alter table abc add partiton(year='2018',moth='03') location 'hdfs://user/data/2018/03'

hive (fdm_sor)> show partitions mytest_tmp;
OK
partition
statis_date=20180304

在给表删除分区时，要注意，如果删除一个不存在的分区，并不会报错，如下删除一个不存在的分区。同样如果是内部表的话，删除分区会把对应分区的数据也删除。但如果是外部表的话，删除分区只是删除元数据，不会删除分区内的数据。

hive (fdm_sor)> alter table mytest_tmp drop  partition(statis_date='2018030005');
OK

3.创建表时给表添加分区字段

CREATE  TABLE `FDM_SOR.mytest_deptaddr`(  
      `dept_no` int,   
      `addr` string,   
      `tel` string)
 partitioned by(statis_date string,location string  ) 
 ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

上面建表时，给表增加了分区，而且分区字段是两个字段。在hdfs上面的表现就是statis_date文件夹下还有一个子文件夹location.多个分区字段，相当于多层文件目录。

4.分区表设置防止删除和被查询的保护

hive中可以给分区表提供防止删除和被查询的保护措施，主要用于一些特殊敏感信息表

1.设置分区禁止删除
alter table t_fin_demo
partition(statis_date ='201902') ENABLE NO_DROP;

测试结果如下：删除失败

hive (fdm_sor)>alter table t_fin_demo
              > partition(statis_date ='201902') ENABLE NO_DROP;
OK
Time taken: 0.267 seconds
hive (fdm_sor)> alter table t_fin_demo drop partition(statis_date ='201902')
              > ;
FAILED: SemanticException [Error 30011]: Partition protected from being dropped fdm_sor@t_fin_demo@statis_date=201902
hive (fdm_sor)> 

2.设置分区数据禁止被查询
alter table t_fin_demo
parition(statis_date ='201902') enable offline

测试结果如下：
hive (fdm_sor)> alter table t_fin_demo                        
              > partition(statis_date ='201903') enable offline
              > ;
OK
Time taken: 0.185 seconds
hive (fdm_sor)>  select * from t_fin_demo where statis_date ='201903' limit 5;
FAILED: SemanticException [Error 10113]: Query against an offline table or partition Table t_fin_demo Partition statis_date=201903


3.注意：如果想取消上述设置，只需要把语句中enable换成disable重新执行下alter语句即可。

5.关于分区的表操作的注意事项

1. 一般来说，查询分区表时，一定会在where子句中加上分区条件，指明查看哪个分区的数据。否则会报错。因为默认set hive.mapred.mode=strict.即严格模式。如果set hive.mapred.mode=nostrict.可以查询分区表时不带分区声明，这个时候会返回整张表的所有数据，即整张表各个分区的数据都会显示出来。这种实际很少使用，因为效率低。

2.往分区表中load数据或者导出数据时，要指定分区。当然也可以使用动态分区，具体参考后续博客。

 load data local inpath'/home/robot/111.txt'

 overwrite into table  FDM_SOR.mytest_deptaddr partition(statis_date='20180228')

3.注意，如果想查看所有分区的数据，除了上面设置nostrict之外，还可以使用 is not null实现查看所有分区数据。

select * from t_fin_demo where statis_date is not null ---这里statis_date 是分区

涤生大数据

关注

4
点赞
踩
17

收藏

觉得还不错? 一键收藏
打赏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录