DDL(Data Definition Language)
Hive数据存储结构
1. Database:Hive中包含了多个数据库,默认的数据库为default,对应于HDFS目录是/user/hive/warehouse,可以通过hive.metastore.warehouse.dir参数进行配置(hive-site.xml中配置)
2. Table: Hive 中的表又分为内部表和外部表 ,Hive 中的每张表对应于HDFS上的一个目录,HDFS目录为:/user/hive/warehouse/[databasename.db]/tables 。
3. Partition:分区,使用某个字段对表进行分区,这样会方便查询,提高效率;并且HDFS上会有对应的分区目录:
/user/hive/warehouse/[databasename.db]/table/partitions
4. Bucket:桶,分区下面还可以进行分桶,可以对表的数据进行更细致的区分,HDFS上的分桶目录:
/user/hive/warehouse/[databasename.db]/table/partitions/buckets
数据库
1. 创建数据库
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value, …)];IF NOT EXISTS:表判断数据库是否存在,不存在就会创建,存在就不会创建。
COMMENT:数据库的描述
LOCATION:指定创建数据库HDFS路径,不加默认在/user/hive/warehouse/路径下
WITH DBPROPERTIES:数据库的属性1. hive> 2. > CREATE DATABASE IF NOT EXISTS hive2 3. > COMMENT "it is my database" 4. > WITH DBPROPERTIES ("creator"="zhangsan", "date"="2018-08-08") 5. > ; 6. OK
创建的数据库元数据信息记录在mysql的basic01(hive-site.xml配置)数据库中DBS表中
可用语句:seletc * from dbs \G; 查看1. mysql> select * from dbs \G; 2. *************************** 1. row *************************** 3. DB_ID: 1 4. DESC: Default Hive database 5. DB_LOCATION_URI: hdfs://192.168.137.200:9000/user/hive/warehouse 6. NAME: default 7. OWNER_NAME: public 8. OWNER_TYPE: ROLE
详细的元数据信息记录在basic01下的database_params表中
1. mysql> select * from database_params \G; 2. *************************** 1. row *************************** 3. DB_ID: 21 4. PARAM_KEY: creator 5. PARAM_VALUE: zhangsan 6. *************************** 2. row *************************** 7. DB_ID: 21 8. PARAM_KEY: date 9. PARAM_VALUE: 2018-08-08 10. 2 rows in set (0.00 sec)
2. 查询数据库
SHOW (DATABASES|SCHEMAS) [LIKE ‘identifier_with_wildcards’]
like 后面跟数据库或表的关键字,可模糊查询1. hive> show databases; 2. OK 3. default 4. word 5. wordcount
3. 查询数据库信息
DESCRIBE DATABASE [EXTENDED] db_name;
DESCRIBE DATABASE db_name:查看数据库的描述信息和文件目录位置路径信息;
EXTENDED:显示数据库详细属性信息。1. hive> describe database hive2; 2. OK 3. hive2 it is my database hdfs://192.168.137.200:9000/user/hive/warehouse/hive2.db hadoop USER 4. Time taken: 0.119 seconds, Fetched: 1 row(s) 5. hive> describe database extended hive2; 6. OK 7. hive2 it is my database hdfs://192.168.137.200:9000/user/hive/warehouse/hive2.db hadoop USER {date=2018-08-08, creator=zhangsan} 8. Time taken: 0.135 seconds, Fetched: 1 row(s)
4. 修改数据库信息
ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, …);
ALTER (DATABASE|SCHEMA) database_name SET OWNER [USER|ROLE] user_or_role;
ALTER (DATABASE|SCHEMA) database_name SET LOCATION hdfs_path;
1. hive> alter database hive2 set dbproperties ("update"="lisi"); 2. OK 3. hive> alter database hive2 set owner user wjx; 4. OK
查看数据库信息:
修改前1. hive> describe database extended hive2; 2. OK 3. hive2 it is my database hdfs://192.168.137.200:9000/user/hive/warehouse/hive2.db hadoop USER {date=2018-08-08, creator=zhangsan} 4. Time taken: 0.135 seconds, Fetched: 1 row(s)
修改后
1. hive> describe database extended hive2; 2. OK 3. hive2 it is my database hdfs://192.168.137.200:9000/user/hive/warehouse/hive2.db wjx USER {date=2018-08- 08, creator=zhangsan, update=lisi} 5. Time taken: 0.235 seconds, Fetched: 1 row(s)
5. 删除数据库
DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];
RESTRICT:默认是restrict,如果该数据库还有表存在则报错;
CASCADE:级联删除数据库(当数据库还有表时,级联删除表后在删除数据库)。1. hive> drop database hive2; 2. OK
表
表的数据类型
int,
long,
float/double,
string,
boolean(布尔值)分隔符
行默认分隔符:“\n”
列指定分隔符:”\t”
1. 创建表
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
– (Note: TEMPORARY available in Hive 0.14.0 and later)
[(col_name data_type [COMMENT col_comment], … [constraint_specification])]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], …)]
[CLUSTERED BY (col_name, col_name, …) [SORTED BY (col_name [ASC|DESC], …)] INTO num_buckets BUCKETS]
[SKEWED BY (col_name, col_name, …) – (Note: Available in Hive 0.10.0 and later)]
ON ((col_value, col_value, …), (col_value, col_value, …), …)
[STORED AS DIRECTORIES]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY ‘storage.handler.class.name’ [WITH SERDEPROPERTIES (…)]
– (Note: Available in Hive 0.6.0 and later)
][LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, …)]
– (Note: Available in Hive 0.6.0 and later)
[AS select_statement];
– (Note: Available in Hive 0.5.0 and later; not supported for external tables)
1.1 TEMPORARY(临时表)
Hive从0.14.0开始提供创建临时表的功能,表只对当前session有效,session退出后,表自动删除。
1. CREATE TEMPORARY TABLE ...
1.2 EXTERNAL(外部表)
Hive上有两种类型的表,一种是Managed Table(默认的),另一种是External Table(加上EXTERNAL关键字)。它俩的主要区别在于:当我们drop表时,Managed Table会同时删去data(存储在HDFS上)和meta data(存储在MySQL),而External Table只会删meta data。
1. hive> create table managed_table( 2. > id int, 3. > name string 4. > );
创建外部表
1. hive> create external table external_table( 2. > id int, 3. > name string 4. > );
查询HDFS上的数据
1. [hadoop@zydatahadoop001 ~]$ hdfs dfs -ls /user/hive/warehouse 2. Found 7 items 3. drwxr-xr-x - hadoop supergroup 0 2017-12-23 00:21 /user/hive/warehouse/external_table 4. drwxr-xr-x - hadoop supergroup 0 2017-12-22 14:42 /user/hive/warehouse/helloword 5. drwxr-xr-x - hadoop supergroup 0 2017-12-22 17:28 /user/hive/warehouse/hive1.db 6. drwxr-xr-x - hadoop supergroup 0 2017-12-22 17:40 /user/hive/warehouse/hive2.db 7. drwxr-xr-x - hadoop supergroup 0 2017-12-23 00:06 /user/hive/warehouse/managed_table 8. drwxr-xr-x - hadoop supergroup 0 2017-12-22 14:58 /user/hive/warehouse/word.db 9. drwxr-xr-x - hadoop supergroup 0 2017-12-22 15:34 /user/hive/warehouse/wordcount.db
表的元数据信息
1. mysql> select * from tbls \G; 2. *************************** 4. row *************************** 3. TBL_ID: 11 4. CREATE_TIME: 1513958794 5. DB_ID: 1 6. LAST_ACCESS_TIME: 0 7. OWNER: hadoop 8. RETENTION: 0 9. SD_ID: 11 10. TBL_NAME: managed_table 11. TBL_TYPE: MANAGED_TABLE 12. VIEW_EXPANDED_TEXT: NULL 13. VIEW_ORIGINAL_TEXT: NULL 14. *************************** 5. row *************************** 15. TBL_ID: 13 16. CREATE_TIME: 1513959668 17. DB_ID: 1 18. LAST_ACCESS_TIME: 0 19. OWNER: hadoop 20. RETENTION: 0 21. SD_ID: 13 22. TBL_NAME: external_table 23. TBL_TYPE: EXTERNAL_TABLE 24. VIEW_EXPANDED_TEXT: NULL 25. VIEW_ORIGINAL_TEXT: NULL
删除表后查询俩表信息
1. hive> drop table managed_table; 2. OK 3. Time taken: 0.807 seconds 4. hive> drop table external_table; 5. OK 1. [hadoop@zydatahadoop001 ~]$ hdfs dfs -ls /user/hive/warehouse 2. Found 6 items 3. drwxr-xr-x - hadoop supergroup 0 2017-12-23 00:21 /user/hive/warehouse/external_table 4. drwxr-xr-x - hadoop supergroup 0 2017-12-22 14:42 /user/hive/warehouse/helloword 5. drwxr-xr-x - hadoop supergroup 0 2017-12-22 17:28 /user/hive/warehouse/hive1.db 6. drwxr-xr-x - hadoop supergroup 0 2017-12-22 17:40 /user/hive/warehouse/hive2.db 7. drwxr-xr-x - hadoop supergroup 0 2017-12-22 14:58 /user/hive/warehouse/word.db 8. drwxr-xr-x - hadoop supergroup 0 2017-12-22 15:34 /user/hive/warehouse/wordcount.db
删除表后,内部表数据和元数据同时删除,外部表只删除元数据信息
1.3 [(col_name data_type [COMMENT col_comment], … [constraint_specification])] [COMMENT table_comment]
col_name:字段名;
data_type:字段类型;
COMMENT col_comment:字段的注释;
[COMMENT table_comment]:表的注释。1. hive> create table student( 2. > id int comment '学号', 3. > name string comment '姓' 4. > ) 5. > comment 'this is student information table'; 6. OK
1.4 PARTITIONED BY(分区表)
- 对表进行以字段分区,可以提高对大数据信息的查询,hive表中,可分为静态分区和动态分区,一般来说,动态分区较为方便。
静态分区
可以根据PARTITIONED BY创建分区表,一个表可以拥有一个或者多个分区,每个分区以文件夹的形式单独存在表文件夹的目录下;
分区是以字段的形式在表结构中存在,通过describe table命令可以查看到字段存在,但是该字段不存放实际的数据内容,仅仅是分区的表示。
分区建表分为2种,一种是单分区,也就是说在表文件夹目录下只有一级文件夹目录。另外一种是多分区,表文件夹下出现多文件夹嵌套模式。
单分区表创建:1. hive> CREATE TABLE order_partition ( 2. > order_number string, 3. > event_time string 4. > ) 5. > PARTITIONED BY (event_month string); 6. OK
将order.txt 文件中的数据加载到order_partition表中
1. hive> load data local inpath '/home/hadoop/order.txt' overwrite into table order_partition partition (event_month='2014-05'); 2. hive> select * from order_partition; 3. 10703007267488 2014-05-01 06:01:12.334+01 2014-05 4. 10101043505096 2014-05-01 07:28:12.342+01 2014-05 5. 10103043509747 2014-05-01 07:50:12.33+01 2014-05 6. 10103043501575 2014-05-01 09:27:12.33+01 2014-05 7. 10104043514061 2014-05-01 09:03:12.324+01 2014-05
使用hadoop 命令创建分区,并把数据put 上去
1. [hadoop@zydatahadoop001 ~]$ hadoop fs -mkdir -p /user/hive/warehouse/order_partition/event_month=2014-06 2. [hadoop@zydatahadoop001 ~]$ hadoop fs -put /home/hadoop/order.txt /user/hive/warehouse/order_partition/event_month=2014-06 3. 上传完成后查看表order_partition 4. hive> select * from order_partition 5. > ; 6. OK 7. 10703007267488 2014-05-01 06:01:12.334+01 2014-05 8. 10101043505096 2014-05-01 07:28:12.342+01 2014-05 9. 10103043509747 2014-05-01 07:50:12.33+01 2014-05 10. 10103043501575 2014-05-01 09:27:12.33+01 2014-05 11. 10104043514061 2014-05-01 09:03:12.324+01 2014-05 12. Time taken: 2.034 seconds, Fetched: 5 row(s)
可以看到通过put上去的数据并没有显示,原因是hdfs上数据确实存在,但mysql中元数据信息并没有记录,可以通过以下命令解决:
msck repair table order_partition;
再次查看:1. hive> select * from order_partition; 2. OK 3. 10703007267488 2014-05-01 06:01:12.334+01 2014-05 4. 10101043505096 2014-05-01 07:28:12.342+01 2014-05 5. 10103043509747 2014-05-01 07:50:12.33+01 2014-05 6. 10103043501575 2014-05-01 09:27:12.33+01 2014-05 7. 10104043514061 2014-05-01 09:03:12.324+01 2014-05 8. 10703007267488 2014-05-01 06:01:12.334+01 2014-06 9. 10101043505096 2014-05-01 07:28:12.342+01 2014-06 10. 10103043509747 2014-05-01 07:50:12.33+01 2014-06 11. 10103043501575 2014-05-01 09:27:12.33+01 2014-06 12. 10104043514061 2014-05-01 09:03:12.324+01 2014-06
多分区
1. hive> CREATE TABLE order_partition2 ( 2. > order_number string, 3. > event_time string 4. > ) 5. > PARTITIONED BY (event_month string, step string); 6. OK
加载数据:
1. hive> load data local inpath '/home/hadoop/order.txt' overwrite into table order_multi_partition partition (event_month='2014-05',step=1);
查询:
1. hive> select * from order_multi_partition;
2. OK
3. 10703007267488 2014-05-01 06:01:12.334+01 2014-05 1
4. 10101043505096 2014-05-01 07:28:12.342+01 2014-05 1
5. 10103043509747 2014-05-01 07:50:12.33+01 2014-05 1
6. 10103043501575 2014-05-01 09:27:12.33+01 2014-05 1
7. 10104043514061 2014-05-01 09:03:12.324+01 2014-05 1
查询对应hdfs路径
1. [hadoop@zydatahadoop001 ~]$ hdfs dfs -ls /user/hive/warehouse/order_multi_partition/event_month=2014-05
18/01/09
2. Found 1 items
3. drwxr-xr-x - hadoop supergroup 0 2018-01-09 22:52 /user/hive/warehouse/order_multi_partition/event_month=2014-05/step=1
可以看到多分区对应的路径文件是多级嵌套。
动态分区
定义:DP列的指定方式与SP列相同 - 在分区子句中( Partition关键字后面),唯一的区别是,DP列没有值,而SP列有值( Partition关键字后面只有key没有value);
在INSERT … SELECT …查询中,必须在SELECT语句中的列中最后指定动态分区列,并按PARTITION()子句中出现的顺序进行排列;
所有DP列 - 只允许在非严格模式下使用。 在严格模式下,我们应该抛出一个错误。
如果动态分区和静态分区一起使用,必须是动态分区的字段在前,静态分区的字段在后。hive表中默认的是静态分区,要使用动态分区需要进行如下设置:
set hive.exec.dynamic.partition=true; (开启动态分区)
set hive.exec.dynamic.partition.mode=nonstrict;(指定动态分区模式,默认为strict,即必须指定至少一个分区为静态分区,nonstrict模式表示允许所有的分区字段都可以使用动态分区。)
永久使用动态分区需要将其配置到hive-site.xml文件中,
*创建动态分区
1. CREATE TABLE emp_dynamic_partition (
2. empno int,
3. ename string,
4. job string,
5. mgr int,
6. hiredate string,
7. salary double,
8. comm double
9. )
10. PARTITIONED BY (deptno int)
使用静态分区
1. CREATE TABLE emp_partition (
2. empno int,
3. ename string,
4. job string,
5. mgr int,
6. hiredate string,
7. salary double,
8. comm double
9. )
10. PARTITIONED BY (deptno int)
11. ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t";
1. insert into table emp_partition partition(deptno=10)
select empno,ename ,job ,mgr ,hiredate ,salary ,comm from emp where deptno=10;
2. insert into table emp_partition partition(deptno=20)
select empno,ename ,job ,mgr ,hiredate ,salary ,comm from emp where deptno=20;
3. insert into table emp_partition partition(deptno=30)
select empno,ename ,job ,mgr ,hiredate ,salary ,comm from emp where deptno=30;
查询结果:
1. hive> select * from emp_partition;
2. OK
3. 7782 CLARK MANAGER 7839 1981/6/9 2450.0 NULL 10
4. 7839 KING PRESIDENT NULL 1981/11/17 5000.0 NULL 10
5. 7934 MILLER CLERK 7782 1982/1/23 1300.0 NULL 10
6. 7369 SMITH CLERK 7902 1980/12/17 800.0 NULL 20
7. 7566 JONES MANAGER 7839 1981/4/2 2975.0 NULL 20
8. 7788 SCOTT ANALYST 7566 1987/4/19 3000.0 NULL 20
9. 7876 ADAMS CLERK 7788 1987/5/23 1100.0 NULL 20
10. 7902 FORD ANALYST 7566 1981/12/3 3000.0 NULL 20
11. 7499 ALLEN SALESMAN 7698 1981/2/20 1600.0 300.0 30
12. 7521 WARD SALESMAN 7698 1981/2/22 1250.0 500.0 30
13. 7654 MARTIN SALESMAN 7698 1981/9/28 1250.0 1400.0 30
14. 7698 BLAKE MANAGER 7839 1981/5/1 2850.0 NULL 30
15. 7844 TURNER SALESMAN 7698 1981/9/8 1500.0 0.0 30
16. 7900 JAMES CLERK 7698 1981/12/3 950.0 NULL 30
使用动态分区
1. insert into table emp_dynamic_partition partition(deptno)
select empno,ename ,job ,mgr ,hiredate ,salary ,comm, deptno from emp;
2. hive> select * from emp_dynamic_partition;
3. OK
4. 7782 CLARK MANAGER 7839 1981/6/9 2450.0 NULL 10
5. 7839 KING PRESIDENT NULL 1981/11/17 5000.0 NULL 10
6. 7934 MILLER CLERK 7782 1982/1/23 1300.0 NULL 10
7. 7369 SMITH CLERK 7902 1980/12/17 800.0 NULL 20
8. 7566 JONES MANAGER 7839 1981/4/2 2975.0 NULL 20
9. 7788 SCOTT ANALYST 7566 1987/4/19 3000.0 NULL 20
10. 7876 ADAMS CLERK 7788 1987/5/23 1100.0 NULL 20
11. 7902 FORD ANALYST 7566 1981/12/3 3000.0 NULL 20
12. 7499 ALLEN SALESMAN 7698 1981/2/20 1600.0 300.0 30
13. 7521 WARD SALESMAN 7698 1981/2/22 1250.0 500.0 30
14. 7654 MARTIN SALESMAN 7698 1981/9/28 1250.0 1400.0 30
15. 7698 BLAKE MANAGER 7839 1981/5/1 2850.0 NULL 30
16. 7844 TURNER SALESMAN 7698 1981/9/8 1500.0 0.0 30
17. 7900 JAMES CLERK 7698 1981/12/3 950.0 NULL 30
可以看到使用动态分区只需要一条命令即可完成表的分区操作,很方便
1.5 ROW FORMAT
- 指定列分隔符:
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’
1.6 STORED AS(存储格式)
- 默认存储格式为text
2. Create Table As Select (CTAS)
- 以select语句结果创建一张表,创建的表具有前表的格式和结果,会有mr
3. LIKE
- 使用like创建表时,只会创建表的结构,并无数据。
4. desc formatted table_name
- 以格式的形式显示出表的信息内容
5. 查询表
- show tables;
- show create table table_name;
6. 删除表
- DROP TABLE [IF EXISTS] table_name [PURGE]; – (Note: PURGE available in Hive 0.14.0 and later)
指定PURGE后,数据不会放到回收箱,会直接删除。
DROP TABLE删除此表的元数据和数据。如果配置了垃圾箱(并且未指定PURGE),则实际将数据移至.Trash / Current目录。元数据完全丢失。
删除EXTERNAL表时,表中的数据不会从文件系统中删除。
7. 修改表(Alter Table)
- 修改表名
ALTER TABLE table_name RENAME TO new_table_name; - 修改分区
ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION ‘location’][, PARTITION partition_spec [LOCATION ‘location’], …];
-
partition_spec:
- (partition_column = partition_col_value, partition_column = partition_col_value,
来自@若泽大数据