在Hive中处理数据时,当处理的一张表的数据量过大的时候,每次查询都是遍历整张表,显然对于计算机来说,是负担比较重的。所以我们可不可以针对数据进行分类,查询时只遍历该分类中的数据,这样就能有效的解决问题。所以就会Hive在表的架构下,就会有分区的这个概念,就是为了满足此需求。
分区表的一个分区对应hdfs上的一个目录
分区表包括静态分区表和动态分区表,根据分区会不会自动创建来区分
多级分区表,即创建的时候指定 PARTITIONED BY (event_month string,loc string),根据顺序,级联创建 event_month=XXX/loc=XXX目录,其他和一级的分区表是一样的
1.静态分区
创建分区表
hive (default)> create table order_partiton(
> order_no string,
> order_time string
> )
> PARTITIONED BY (event_month string)
> row format delimited fields terminated by '\t';
OK
Time taken: 0.052 seconds
加载数据到HIVE表
hive (default)> load data local inpath '/home/hadoop/order_created.txt' into table order_partiton PARTITION (event_month='2014-05');
Loading data to table default.order_partiton partition (event_month=2014-05)
Partition default.order_partiton{event_month=2014-05} stats: [numFiles=1, numRows=0, totalSize=213, rawDataSize=0]
OK
Time taken: 0.347 seconds
根据分区查找数据
hive (default)> select * from order_partiton where event_month='2014-05';
OK
order_partiton.order_no order_partiton.order_time order_partiton.event_month
10703007267488 2014-05-01 06:01:12.334+01 2014-05
10101043505096 2014-05-01 07:28:12.342+01 2014-05
10103043509747 2014-05-01 07:50:12.33+01 2014-05
10103043501575 2014-05-01 09:27:12.33+01 2014-05
10104043514061 2014-05-01 09:03:12.324+01 2014-05
Time taken: 0.163 seconds, Fetched: 5 row(s)
真正的表的字段是不包含分区字段的,分区字段只是HDFS上的文件夹的名称
在生产上,一般:数据经过清洗后存放在HDFS目录上,然后将目录的数据加载到分区表中
一般都会导致元数据缺失,需要修复相应的表,或者另外的命令:ALTER TABLE table_name ADD IF NOT EXISTS
PARTITION (event_month='xxx') ;
hive (default)> msck repair table 表名;
创建多级分区
hive (default)> create table order_mulit_partiton(
> order_no string,
> order_time string
> )
> PARTITIONED BY (event_month string, step string)
> row format delimited fields terminated by '\t';
OK
Time taken: 0.074 seconds
hive (default)> desc formatted order_mulit_partiton;
OK
col_name data_type comment
# col_name data_type comment
order_no string
order_time string
# Partition Information
# col_name data_type comment
event_month string
step string
# Detailed Table Information
Database: default
Owner: hadoop
CreateTime: Mon Jul 22 16:10:11 CST 2019
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://hadoop001:9000/user/hive/warehouse/order_mulit_partiton
Table Type: MANAGED_TABLE
Table Parameters:
numPartitions 0
transient_lastDdlTime 1563783011
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim \t
serialization.format \t
Time taken: 0.082 seconds, Fetched: 35 row(s)
2.动态分区
创建分区表
CREATE TABLE emp_partition(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double
)
PARTITIONED BY (deptno int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
插入数据的时候,不指定分区,在select的最后添加分区字段
insert overwrite table emp_dynamic_partition partition(deptno)
select empno,ename,job,mgr,hiredate,sal,comm,deptno from ruozedata_emp;
hive (default)> select * from emp_dynamic_partition where deptno='10';
OK
emp_dynamic_partition.empno emp_dynamic_partition.ename emp_dynamic_partition.job emp_dynamic_partition.mgr emp_dynamic_partition.hiredate emp_dynamic_partition.salemp_dynamic_partition.comm emp_dynamic_partition.deptno
7782 'CLARK' 'MANAGER' 7839 '1981-06-09' 2450.0 NULL 10
7839 'KING' 'PRESIDENT' NULL '1981-11-17' 5000.0 NULL 10
7934 'MILLER' 'CLERK' 7782 '1982-01-23' 1300.0 NULL 10
Time taken: 0.058 seconds, Fetched: 3 row(s)
hive (default)> select * from emp_dynamic_partition where deptno='20';
OK
emp_dynamic_partition.empno emp_dynamic_partition.ename emp_dynamic_partition.job emp_dynamic_partition.mgr emp_dynamic_partition.hiredate emp_dynamic_partition.salemp_dynamic_partition.comm emp_dynamic_partition.deptno
7369 'SMITH' 'CLERK' 7902 '1980-12-17' 800.0 NULL 20
7566 'JONES' 'MANAGER' 7839 '1981-04-02' 2975.0 NULL 20
7788 'SCOTT' 'ANALYST' 7566 '1982-12-09' 3000.0 NULL 20
7876 'ADAMS' 'CLERK' 7788 '1983-01-12' 1100.0 NULL 20
7902 'FORD' 'ANALYST' 7566 '1981-12-03' 3000.0 NULL 20
7534 'MANAGER' 'CLERK' 7582 '1982-02-23' 1300.0 NULL 20
Time taken: 0.058 seconds, Fetched: 6 row(s)
hive (default)> select * from emp_dynamic_partition where deptno='30';
OK
emp_dynamic_partition.empno emp_dynamic_partition.ename emp_dynamic_partition.job emp_dynamic_partition.mgr emp_dynamic_partition.hiredate emp_dynamic_partition.salemp_dynamic_partition.comm emp_dynamic_partition.deptno
7499 'ALLEN' 'SALESMAN' 7698 '1981-02-20' 1600.0 300.0 30
7521 'WARD' 'SALESMAN' 7698 '1981-02-22' 1250.0 500.0 30
7654 'MARTIN' 'SALESMAN' 7698 '1981-09-28' 1250.0 1400.0 30
7698 'BLAKE' 'MANAGER' 7839 '1981-05-01' 2850.0 NULL 30
7844 'TURNER' 'SALESMAN' 7698 '1981-09-08' 1500.0 0.0 30
7900 'JAMES' 'CLERK' 7698 '1981-12-03' 950.0 NULL 30
Time taken: 0.056 seconds, Fetched: 6 row(s)
3.beeline连接hive
启动hiveserver2 服务
[hadoop@hadoop001 hive]$ hiveserver2
which: no hbase in (/home/hadoop/app/hive/bin:/sbin:/home/hadoop/app/hadoop/bin:/home/hadoop/app/hadoop/sbin:/usr/loca/mysql/bin:/usr/java/jdk1.8.0_45/bin/:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin)
使用beeline连接hive
固定格式 beeline -u jdbc:hive2://hive机器名或IP:HS2端口 -n 当前用户
[hadoop@hadoop001 ~]$ beeline -u jdbc:hive2://hadoop001:10000/ruozedata_d7 -n hadoop
which: no hbase in (/home/hadoop/app/hive/bin:/sbin:/home/hadoop/app/hadoop/bin:/home/hadoop/app/hadoop/sbin:/usr/loca/mysql/bin:/usr/java/jdk1.8.0_45/bin/:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin)
scan complete in 3ms
Connecting to jdbc:hive2://hadoop001:10000/ruozedata_d7
Connected to: Apache Hive (version 1.1.0-cdh5.15.1)
Driver: Hive JDBC (version 1.1.0-cdh5.15.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.1.0-cdh5.15.1 by Apache Hive
0: jdbc:hive2://hadoop001:10000/ruozedata_d7> show tables;
INFO : Compiling command(queryId=hadoop_20190722165050_4c86c53b-4d43-4e55-8710-d07cf2bb48be): show tables
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:tab_name, type:string, comment:from deserializer)], properties:null)
INFO : Completed compiling command(queryId=hadoop_20190722165050_4c86c53b-4d43-4e55-8710-d07cf2bb48be); Time taken: 0.771 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hadoop_20190722165050_4c86c53b-4d43-4e55-8710-d07cf2bb48be): show tables
INFO : Starting task [Stage-0:DDL] in serial mode
INFO : Completed executing command(queryId=hadoop_20190722165050_4c86c53b-4d43-4e55-8710-d07cf2bb48be); Time taken: 0.069 seconds
INFO : OK
+------------------+--+
| tab_name |
+------------------+--+
| hive_array |
| hive_map |
| hive_rolenumber |
| hive_rownumber |
| hive_struct |
| rating_json |
| rating_width |
+------------------+--+
7 rows selected (1.348 seconds)
默认hiveserver2的端口为10000,可以在hive-site.xml中修改
<property>
<name>hive.server2.thrift.port</name>
<value>10001</value>
</property>