大数据之hive：hive分区表

最新推荐文章于 2024-04-22 09:30:00 发布

浊酒南街

最新推荐文章于 2024-04-22 09:30:00 发布

阅读量992

点赞数 1

分类专栏： # 大数据系列二文章标签： hive big data

本文链接：https://blog.csdn.net/weixin_43597208/article/details/119943156

版权

大数据系列二专栏收录该内容

110 篇文章 0 订阅

订阅专栏

一、分区表以及作用

分区表是将数据以一种符合逻辑的方式进行组织，以对表进行合理的管理以及提高查询效率。
一个分区实际上就是表下的一个目录，一个表可以在多个维度上进行分区，分区之间的关系就是目录树的关系。
分区表根据不同的分类方式有静态分区和动态分区，还有单分区和多分区，时间场景常见的有时间分区和业务分区；

引进分区技术，使用分区技术，避免hive全表扫描，缩小数据扫描的范围，因为可以在select时指定要统计的分区，提高查询效率
创建分区，需要在create表的时候调用可选参数partitioned by；

二、静态分区

1、创建静态分区:

create table t1(
    id      int
   ,name    string
   ,hobby   array<string>
   ,address     map<String,string>
)
partitioned by (day string)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
;

注：这里分区字段不能和表中的字段重复。如果分区字段和表中字段相同的话，会报错。

2、加载数据

加载数据的方法常用的有两种：
一种是load，直接加载一个文档的数据，一种是用insert into（overwrite）数据一条条加载或搭配select 加载全部数据；

load data local inpath '/home/hadoop/Desktop/data' overwrite into table t1 partition ( day = '2017-01-01');

insert into table  t1  partition (day = '2017-01-01') values (1   xiaoming    ["book","TV","code"]    {"beijing":"chaoyang","shagnhai":"pudong"})

insert overwrite table  emp_partition PARTITION (deptno=10) select * from emp where deptno=10;
--注：emp_partition 7个字段（不包含分区deptno）,则emp表也应该有7个对应字段;

3、查看数据及分区

select * from t1;
1   xiaoming    ["book","TV","code"]    {"beijing":"chaoyang","shagnhai":"pudong"}  2017-01-01
2   lilei   ["book","code"] {"nanjing":"jiangning","taiwan":"taibei"}   2017-01-01
3   lihua   ["music","book"]    {"heilongjiang":"haerbin"}  2017-01-01

查看分区：

show partitions t1;

4、增加分区

alter table t1 add partition (day = '2017-01-02');
alter table t1 add if not exists partition(day ='2017-01-03')
alter table t1 add partition(day = '2017-01-04') partition(day = '2017-01-05'); 
--注：同时加载多个分区

5、查询某一分区的数据

select * from t1 where day= '2017-01-02';

静态双分区类似，只不过分区的顺序决定了谁是父目录，谁是子目录，如partitioned by (year string，month string)

insert overwrite table part_test_3 partition(month_id='201805',day_id='20180509') select * from part_test_temp;

三、动态分区

为什么要使用动态分区呢，我们举个例子，假如中国有50个省，每个省有50个市，每个市都有100个区，那我们都要使用静态分区要使用多久才能搞完。所有我们要使用动态分区。
动态分区默认是没有开启。开启后默认是以严格模式执行的，在这种模式下需要至少一个分区字段是静态的。

关闭严格分区模式
动态分区模式时是严格模式，也就是至少有一个静态分区。
set hive.exec.dynamic.partition.mode=nonstrict	//分区模式，默认nostrict
set hive.exec.dynamic.partition=true			//开启动态分区,默认true
set hive.exec.max.dynamic.partitions=1000		//最大动态分区数,默认1000

1、创建一个普通动态分区表一:

--创建动态分区表
create table if not exists  zxz_5(
 name string,
 nid int,
 phone string,
 ntime date
 ) 
 partitioned by (year int,month int) 
 row format delimited 
 fields terminated by "|"
 lines terminated by "\n"
 stored as textfile;
 --添加数据
insert overwrite table  zxz_5 partition (year,month) select name,nid,phone,ntime,year(ntime) as year ,month(ntime) as     month from zxz_dy; 
--zxz_5这个表里面存放着数据。
--我们利用year，和month函数来获取ntime列的年和月来作为分区，这个是靠我们查询到的数据来分区的

2、创建一个普通动态分区表二:

create table orders_part(
order_id string,
user_id string,
eval_set string,
order_number string,
order_hour_of_day string,
days_since_prior_order string
)partitioned by(order_dow string)
row format delimited fields terminated by ',';
 
--添加数据
insert into table orders_part partition (order_dow) select order_id,user_id,eval_set,order_number,order_hour_of_day,days_since_prior_order,order_dow from orders;
--其中orders表中的字段是：
--order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order

3、加载数据：

加载数据

# month_id和 day_id均为动态分区：
insert overwrite table dynamic_test partition(month_id,day_id)  
select c1,c2,c3,c4,c5,c6,c7,substr(day_id,1,6) as month_id,day_id from kafka_offset;

#month_id为静态分区，day_id为动态分区：
insert overwrite table dynamic_test partition(month_id='201710',day_id)  
select c1,c2,c3,c4,c5,c6,c7,day_id from kafka_offset
where substr(day_id,1,6)='201710';

为了让分区列的值相同的数据尽量在同一个mapreduce中，这样每一个mapreduce可以尽量少的产生新的文件夹，可以借助distribute by的功能，将分区列值相同的数据放到一起。

--进行优化
insert overwrite table dynamic_test partition(month_id,day_id)
select c1,c2,c3,c4,c5,c6,c7,substr(day_id,1,6) as month_id,day_id from kafka_offset
distribute by month_id,day_id;

浊酒南街

关注

1
点赞
踩
5

收藏

觉得还不错? 一键收藏
1
评论
大数据之hive：hive分区表

目录一、分区表以及作用二、静态分区1、创建静态分区:2、加载数据3、查看数据及分区4、增加分区5、查询某一分区的数据三、动态分区1、创建一个普通动态分区表一:2、创建一个普通动态分区表二:3、加载数据：一、分区表以及作用分区表是将数据以一种符合逻辑的方式进行组织，以对表进行合理的管理以及提高查询效率。一个分区实际上就是表下的一个目录，一个表可以在多个维度上进行分区，分区之间的关系就是目录树的关系。分区表根据不同的分类方式有静态分区和动态分区，还有单分区和多分区，时间场景常见的有时间分区和业务分区；
复制链接

扫一扫