hive按日期年月实现动态分区，分桶表创建

大胖头leo

于 2020-07-30 08:32:34 发布

阅读量8.9k

点赞数

分类专栏： hadoop

本文链接：https://blog.csdn.net/a8131357leo/article/details/107680890

版权

hadoop 专栏收录该内容

14 篇文章 0 订阅

订阅专栏

注意：分区和分桶都是按字段来组织数据的存放，分区是相同的字段值存放在一个文件中，而分桶是字段哈希值相同的数据存放在一个文件中。

目标：按照表中数据创建时间的年月来进行分区

Hive分区分为静态分区和动态分区

概念

静态分区：加载数据到指定分区的值。（按照固定的值进行分区：1，2，3就只分三个区）

动态分区：数据未知，根据分区的值来确定需要创建的分区。（当 4 出现的时候，就建立新的分区）

动态分区的属性：

参数名	默认值	描述
set hive.exec.dynamic.partition=true	false	设置true的时候为打开动态分区
set hive.exec.dynamic.partition.mode	strict	这个属性默认值是strict,就是要求分区字段必须有一个是静态的分区值，当前设置为nonstrict,那么可以全部动态分区
set hive.exec.max.dynamic.partitions.pernode	100	能被每个mapper或者reducer创建的最大动态分区的数目，如果一个mapper或者reducer试图创建多余这个值的动态分区数目会引发错误
set hive.exec.max.dynamic.partitions	1000	被一条带有动态分区的SQL语句所能创建的动态分区总量，总共的最大的动态分区数，如果超出限制会报错
set hive.exec.max.created.files	100000	全局能被创建文件数目的最大值，专门有一个hadoop计数器来跟踪该值，如果超出会报错

静态分区是在语句中指定分区字段为某个固定值，动态分区就相对灵活的多。

一个分区实际上就是表下的一个目录，一个表可以在多个维度上进行分区，分区之间的关系就是目录树的关系。

hive动态分区

先将mysql表testtable用sqoop导入到hive中，采用自动建表的方式导入。(如果你的hive表已存在，这步可以忽略)
然后把建表语句拷出来（show create table testtable）
创建新表，字段不变修改表名添加年月分区字段，如下

CREATE TABLE `testtable1`(
   `id` bigint, 
   `payorderno` string, 
   `tradeno` string, 
   `tradetype` int, 
   `goodsname` string, 
   `tradetime` bigint, 
   `storename` string, 
   `paymentuser` string, 
   `payway` int, 
   `tradeamount` string, 
   `refundno` string, 
   `servicecharge` string, 
   `remark` string, 
   `bankcode` string, 
   `thirdpayaccountid` string, 
   `creationtime` bigint)
 partitioned by (year int,month int)
 row format delimited fields terminated by '\t' 
 stored as parquetfile;

4.关闭严格分区模式

动态分区模式是严格模式，也就是至少有一个静态分区。

 set hive.exec.dynamic.partition.mode=nonstrict	//分区模式，默认nostrict
 set hive.exec.dynamic.partition=true			//开启动态分区,默认true
 set hive.exec.max.dynamic.partitions=1000		//最大动态分区数,默认1000

在spark下可以通过config进行配置

    .config("hive.exec.dynamic.partition", True) \
    .config("hive.exec.dynamic.partition.mode", "nonstrict") \

5.然后根据时间动态分区，将数据插入到新表中，新表就实现分区了

creationtime为时间格式的转换

insert overwrite table testtable1 partition (year，month) 
 select *,year(creationtime) as year, month(creathiontime) as month
 from testtable;

creationtime为时间戳格式的转换（导入parquet格式时间格式默认是时间戳

insert overwrite table testtable1 partition(year,month) select
  *,from_unixtime(cast(createtime/1000 as int),'yyyy') as year,
 month(from_unixtime(cast(createtime/1000 as int),'yyyy-MM-dd HH:mm:ss')) 
 as month from testtable;

删除原表，分区表重命名

drop table testtable;
alter table testtable1 rename to testtable;

动态分区之后的数据存储路径，一个分区就是一个目录

在这里插入图片描述

分桶表

分桶是将某个字段取哈希值，哈希值相同的数据分发到一个桶中。
在创建分桶表的时候必须指定分桶的字段，和要分桶的数量。

创建分桶表SQL语句如下：

create table user_bucket( 
	userid int, 
	username string, 
	fullname string)
clustered by(userid) into 2 buckets  
row format delimited fields terminated by '\t'
lines terminated by '\n'
stored as parquetfile;

导入数据到user_bucket分桶表中的步骤：
设置使用分桶属性：

set hive.enforce.bucketing = true;

执行SQL语句

insert overwrite table user_bucket select userid,username,fullname from user;

hive读写模式：

Hive是一个严格的读时模式。 写数据不管数据正确性，读的时候，不对则用NULL替代。
mysql是一个的写时模式。 写的时候检查语法，不okay就会报错。

load data local inpath '/home/hivedata/t' into  table t_user;
insert into stu(id,sex) value(1,abc);

大胖头leo

关注

0
点赞
踩
27

收藏

觉得还不错? 一键收藏
0
评论
hive按日期年月实现动态分区，分桶表创建

注意：分区和分桶都是按字段来组织数据的存放，分区是相同的字段值存放在一个文件中，而分桶是字段哈希值相同的数据存放在一个文件中。目录Hive分区分为静态分区和动态分区概念动态分区的属性：hive动态分区分桶表hive读写模式：目标：按照表中数据创建时间的年月来进行分区Hive分区分为静态分区和动态分区概念静态分区：加载数据到指定分区的值。（按照固定的值进行分区：1，2，3就只分三个区）动态分区：数据未知，根据分区的值来确定需要创建的分区。（当 4 出..
复制链接

扫一扫