hive分区

官网链接地址:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-DropPartitions

partition

个人理解: 分区是真实存在的,对应的是hdfs上的路径,例:

create external table tb_name (
id string comment 'id'
)
partitioned by (prov_id string, day_id string)

对应到hdfs上的路径就是 'hdfs://localhost:9083/tmp_db/tb_name/prov_id=123/day_id=20190530'
优点:便于文件管理,例如有需求‘在tb_name 表中查找查找20190604数据的数据量‘,
	 若没有分区,则需要全表扫描,再进行sum,无疑是增加了运算时间。增加分区字段后,
	 直接映射到hdfs路径下。
	 
	 但粒度不能过细,如有需求分析每天的数据,粒度分到day即可,也可以到hour。




1 分区的增删改查

增:分区数与建表的分区数量一致,顺序可以不同(按字段名进行匹配),不能增加分区字段,增加的是列。
alter table tb_name add partition (prov_id='bj', day_id='20190530')

删:external 表删除之前,先删除分区,再删除表。新建表之后修复分区,msck repair table tb_name ,数据还存在。分区删除的时候会放在.Trash/Current,但是元数据会彻底丢失。(You can use ALTER TABLE DROP PARTITION to drop a partition for a table. This removes the data and metadata for this partition. The data is actually moved to the .Trash/Current directory if Trash is configured, unless PURGE is specified, but the metadata is completely lost (see Drop Table above).)

alter table tb_name drop [if exisits] partition (prov_id='bj', day_id='20190530')

改:
ALTER TABLE table_name PARTITION partition_spec RENAME TO PARTITION partition_spec;

查:
select * from tb_name where day_id='20190530' and prov_id='bj'

2修复分区

如果直接在hdfs种删除或者增加分区,则需要修复,维护的是元数据信息。
MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];

官档:
Hive stores a list of partitions for each table in its metastore. If, however, new partitions are directly added to HDFS (say by using hadoop fs -put command) or removed from HDFS, the metastore (and hence Hive) will not be aware of these changes to partition information unless the user runs ALTER TABLE table_name ADD/DROP PARTITION commands on each of the newly added or removed partitions, respectively.
However, users can run a metastore check command with the repair table option:

which will update metadata about partitions to the Hive metastore for partitions for which such metadata doesn't already exist. The default option for MSC command is ADD PARTITIONS. With this option, it will add any partitions that exist on HDFS but not in metastore to the metastore. The DROP PARTITIONS option will remove the partition information from metastore, that is already removed from HDFS. The SYNC PARTITIONS option is equivalent to calling both ADD and DROP PARTITIONS. See HIVE-874 and HIVE-17824 for more details. When there is a large number of untracked partitions, there is a provision to run MSCK REPAIR TABLE batch wise to avoid OOME (Out of Memory Error). By giving the configured batch size for the property hive.msck.repair.batch.size it can run in the batches internally. The default value of the property is zero, it means it will execute all the partitions at once. MSCK command without the REPAIR option can be used to find details about metadata mismatch metastore.
The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is:
ALTER TABLE table_name RECOVER PARTITIONS;
Starting with Hive 1.3, MSCK will throw exceptions if directories with disallowed characters in partition values are found on HDFS. Use hive.msck.path.validation setting on the client to alter this behavior; "skip" will simply skip the directories. "ignore" will try to create partitions anyway (old behavior). This may or may not work.

3动态分区

作用:插入数据时,一列作为partition的值。
tips : 插入之前,set hive.exec.dunamic.partition.mode=nostrict;

作用:插入数据时,一列作为partition的值。
demo:
insert overwrite table tb_name partition(prov_id, day_id)
select id,name,age,prov_id,day_id from tmp_tb


开启动态分区(默认true)
set hive.exec.dynamic.partition=true;
严格模式: 默认strict(必须先建目录)
set hive.exec.dunamic.partition.mode=nostrict;




  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值