HIVE的分区

最新推荐文章于 2023-09-20 15:53:33 发布

木_头人

最新推荐文章于 2023-09-20 15:53:33 发布

阅读量417

点赞数

分类专栏： hive

本文链接：https://blog.csdn.net/weixin_43975538/article/details/96862526

版权

hive 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

在Hive中处理数据时，当处理的一张表的数据量过大的时候，每次查询都是遍历整张表，显然对于计算机来说，是负担比较重的。所以我们可不可以针对数据进行分类，查询时只遍历该分类中的数据，这样就能有效的解决问题。所以就会Hive在表的架构下，就会有分区的这个概念，就是为了满足此需求。

分区表的一个分区对应hdfs上的一个目录
分区表包括静态分区表和动态分区表，根据分区会不会自动创建来区分

多级分区表，即创建的时候指定 PARTITIONED BY (event_month string,loc string)，根据顺序，级联创建 event_month=XXX/loc=XXX目录，其他和一级的分区表是一样的

1.静态分区

创建分区表
hive (default)> create table order_partiton(
              > order_no string,
              > order_time string
              > )
              > PARTITIONED BY (event_month string)
              > row format delimited fields terminated by '\t';
OK
Time taken: 0.052 seconds
加载数据到HIVE表
hive (default)> load data local inpath '/home/hadoop/order_created.txt' into table order_partiton PARTITION (event_month='2014-05');
Loading data to table default.order_partiton partition (event_month=2014-05)
Partition default.order_partiton{event_month=2014-05} stats: [numFiles=1, numRows=0, totalSize=213, rawDataSize=0]
OK
Time taken: 0.347 seconds

根据分区查找数据

hive (default)> select * from order_partiton where event_month='2014-05';
OK
order_partiton.order_no	order_partiton.order_time	order_partiton.event_month
10703007267488	2014-05-01 06:01:12.334+01	2014-05
10101043505096	2014-05-01 07:28:12.342+01	2014-05
10103043509747	2014-05-01 07:50:12.33+01	2014-05
10103043501575	2014-05-01 09:27:12.33+01	2014-05
10104043514061	2014-05-01 09:03:12.324+01	2014-05
Time taken: 0.163 seconds, Fetched: 5 row(s)

真正的表的字段是不包含分区字段的，分区字段只是HDFS上的文件夹的名称

在生产上，一般：数据经过清洗后存放在HDFS目录上，然后将目录的数据加载到分区表中
一般都会导致元数据缺失，需要修复相应的表,或者另外的命令：ALTER TABLE table_name ADD IF NOT EXISTS
 PARTITION (event_month='xxx') ;
hive (default)> msck repair table 表名;

创建多级分区
hive (default)> create table order_mulit_partiton(
              > order_no string,
              > order_time string
              > )
              > PARTITIONED BY (event_month string, step string)
              > row format delimited fields terminated by '\t';
OK
Time taken: 0.074 seconds
hive (default)> desc formatted order_mulit_partiton;
OK
col_name	data_type	comment
# col_name            	data_type           	comment             
	 	 
order_no            	string              	                    
order_time          	string              	                    
	 	 
# Partition Information	 	 
# col_name            	data_type           	comment             
	 	 
event_month         	string              	                    
step                	string              	                    
	 	 
# Detailed Table Information	 	 
Database:           	default             	 
Owner:              	hadoop              	 
CreateTime:         	Mon Jul 22 16:10:11 CST 2019	 
LastAccessTime:     	UNKNOWN             	 
Protect Mode:       	None                	 
Retention:          	0                   	 
Location:           	hdfs://hadoop001:9000/user/hive/warehouse/order_mulit_partiton	 
Table Type:         	MANAGED_TABLE       	 
Table Parameters:	 	 
	numPartitions       	0                   
	transient_lastDdlTime	1563783011          
	 	 
# Storage Information	 	 
SerDe Library:      	org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe	 
InputFormat:        	org.apache.hadoop.mapred.TextInputFormat	 
OutputFormat:       	org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat	 
Compressed:         	No                  	 
Num Buckets:        	-1                  	 
Bucket Columns:     	[]                  	 
Sort Columns:       	[]                  	 
Storage Desc Params:	 	 
	field.delim         	\t                  
	serialization.format	\t                  
Time taken: 0.082 seconds, Fetched: 35 row(s)

2.动态分区

创建分区表
CREATE TABLE emp_partition(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double
)
PARTITIONED BY (deptno int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

插入数据的时候，不指定分区，在select的最后添加分区字段

insert overwrite table emp_dynamic_partition partition(deptno)
select empno,ename,job,mgr,hiredate,sal,comm,deptno from ruozedata_emp;

hive (default)> select * from emp_dynamic_partition where deptno='10';
OK
emp_dynamic_partition.empno	emp_dynamic_partition.ename	emp_dynamic_partition.job	emp_dynamic_partition.mgr	emp_dynamic_partition.hiredate	emp_dynamic_partition.salemp_dynamic_partition.comm	emp_dynamic_partition.deptno
7782	'CLARK'	'MANAGER'	7839	'1981-06-09'	2450.0	NULL	10
7839	'KING'	'PRESIDENT'	NULL	'1981-11-17'	5000.0	NULL	10
7934	'MILLER'	'CLERK'	7782	'1982-01-23'	1300.0	NULL	10
Time taken: 0.058 seconds, Fetched: 3 row(s)
hive (default)> select * from emp_dynamic_partition where deptno='20';
OK
emp_dynamic_partition.empno	emp_dynamic_partition.ename	emp_dynamic_partition.job	emp_dynamic_partition.mgr	emp_dynamic_partition.hiredate	emp_dynamic_partition.salemp_dynamic_partition.comm	emp_dynamic_partition.deptno
7369	'SMITH'	'CLERK'	7902	'1980-12-17'	800.0	NULL	20
7566	'JONES'	'MANAGER'	7839	'1981-04-02'	2975.0	NULL	20
7788	'SCOTT'	'ANALYST'	7566	'1982-12-09'	3000.0	NULL	20
7876	'ADAMS'	'CLERK'	7788	'1983-01-12'	1100.0	NULL	20
7902	'FORD'	'ANALYST'	7566	'1981-12-03'	3000.0	NULL	20
7534	'MANAGER'	'CLERK'	7582	'1982-02-23'	1300.0	NULL	20
Time taken: 0.058 seconds, Fetched: 6 row(s)
hive (default)> select * from emp_dynamic_partition where deptno='30';
OK
emp_dynamic_partition.empno	emp_dynamic_partition.ename	emp_dynamic_partition.job	emp_dynamic_partition.mgr	emp_dynamic_partition.hiredate	emp_dynamic_partition.salemp_dynamic_partition.comm	emp_dynamic_partition.deptno
7499	'ALLEN'	'SALESMAN'	7698	'1981-02-20'	1600.0	300.0	30
7521	'WARD'	'SALESMAN'	7698	'1981-02-22'	1250.0	500.0	30
7654	'MARTIN'	'SALESMAN'	7698	'1981-09-28'	1250.0	1400.0	30
7698	'BLAKE'	'MANAGER'	7839	'1981-05-01'	2850.0	NULL	30
7844	'TURNER'	'SALESMAN'	7698	'1981-09-08'	1500.0	0.0	30
7900	'JAMES'	'CLERK'	7698	'1981-12-03'	950.0	NULL	30
Time taken: 0.056 seconds, Fetched: 6 row(s)

3.beeline连接hive

启动hiveserver2 服务

[hadoop@hadoop001 hive]$ hiveserver2 
which: no hbase in (/home/hadoop/app/hive/bin:/sbin:/home/hadoop/app/hadoop/bin:/home/hadoop/app/hadoop/sbin:/usr/loca/mysql/bin:/usr/java/jdk1.8.0_45/bin/:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin)

使用beeline连接hive

固定格式 beeline -u jdbc:hive2://hive机器名或IP:HS2端口 -n 当前用户
[hadoop@hadoop001 ~]$ beeline -u jdbc:hive2://hadoop001:10000/ruozedata_d7 -n hadoop
which: no hbase in (/home/hadoop/app/hive/bin:/sbin:/home/hadoop/app/hadoop/bin:/home/hadoop/app/hadoop/sbin:/usr/loca/mysql/bin:/usr/java/jdk1.8.0_45/bin/:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin)
scan complete in 3ms
Connecting to jdbc:hive2://hadoop001:10000/ruozedata_d7
Connected to: Apache Hive (version 1.1.0-cdh5.15.1)
Driver: Hive JDBC (version 1.1.0-cdh5.15.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.1.0-cdh5.15.1 by Apache Hive
0: jdbc:hive2://hadoop001:10000/ruozedata_d7> show tables;
INFO  : Compiling command(queryId=hadoop_20190722165050_4c86c53b-4d43-4e55-8710-d07cf2bb48be): show tables
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:tab_name, type:string, comment:from deserializer)], properties:null)
INFO  : Completed compiling command(queryId=hadoop_20190722165050_4c86c53b-4d43-4e55-8710-d07cf2bb48be); Time taken: 0.771 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hadoop_20190722165050_4c86c53b-4d43-4e55-8710-d07cf2bb48be): show tables
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=hadoop_20190722165050_4c86c53b-4d43-4e55-8710-d07cf2bb48be); Time taken: 0.069 seconds
INFO  : OK
+------------------+--+
|     tab_name     |
+------------------+--+
| hive_array       |
| hive_map         |
| hive_rolenumber  |
| hive_rownumber   |
| hive_struct      |
| rating_json      |
| rating_width     |
+------------------+--+
7 rows selected (1.348 seconds)

默认hiveserver2的端口为10000，可以在hive-site.xml中修改

<property>
<name>hive.server2.thrift.port</name>
<value>10001</value>
</property>

木_头人

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
HIVE的分区

在Hive中处理数据时，当处理的一张表的数据量过大的时候，每次查询都是遍历整张表，显然对于计算机来说，是负担比较重的。所以我们可不可以针对数据进行分类，查询时只遍历该分类中的数据，这样就能有效的解决问题。所以就会Hive在表的架构下，就会有分区的这个概念，就是为了满足此需求。分区表的一个分区对应hdfs上的一个目录分区表包括静态分区表和动态分区表，根据分区会不会自动创建来区分多级分区表，即创...
复制链接

扫一扫

专栏目录