Hive 数据定义语言（DDL)

最新推荐文章于 2024-07-30 15:56:26 发布

banana`

最新推荐文章于 2024-07-30 15:56:26 发布

阅读量1.2k

点赞数

分类专栏： Hive 文章标签： HiveQL

本文链接：https://blog.csdn.net/Realoyou/article/details/79167453

版权

Hive 专栏收录该内容

11 篇文章 1 订阅

订阅专栏

DDL(Data Definition Language)

HiveQL 语句可以通过hive官网进行全面细致学习，这里概述hive的DDL语言

Hive数据存储结构

这里写图片描述
1. Database:Hive中包含了多个数据库，默认的数据库为default，对应于HDFS目录是/user/hive/warehouse，可以通过hive.metastore.warehouse.dir参数进行配置（hive-site.xml中配置）
2. Table: Hive 中的表又分为内部表和外部表 ,Hive 中的每张表对应于HDFS上的一个目录，HDFS目录为：/user/hive/warehouse/[databasename.db]/tables 。
3. Partition:分区，使用某个字段对表进行分区，这样会方便查询，提高效率；并且HDFS上会有对应的分区目录：
/user/hive/warehouse/[databasename.db]/table/partitions
4. Bucket:桶，分区下面还可以进行分桶，可以对表的数据进行更细致的区分，HDFS上的分桶目录：
/user/hive/warehouse/[databasename.db]/table/partitions/buckets

数据库

1. 创建数据库

CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value, …)];
IF NOT EXISTS：表判断数据库是否存在，不存在就会创建，存在就不会创建。
COMMENT：数据库的描述
LOCATION：指定创建数据库HDFS路径，不加默认在/user/hive/warehouse/路径下
WITH DBPROPERTIES：数据库的属性
```
    1. hive> 
    2. > CREATE DATABASE IF NOT EXISTS hive2
    3. > COMMENT "it is my database"
    4. > WITH DBPROPERTIES ("creator"="zhangsan", "date"="2018-08-08")
    5. > ;
    6. OK
```

创建的数据库元数据信息记录在mysql的basic01（hive-site.xml配置）数据库中DBS表中
可用语句：seletc * from dbs \G; 查看

1. mysql> select * from dbs \G;
2. *************************** 1. row ***************************
3.           DB_ID: 1
4.            DESC: Default Hive database
5. DB_LOCATION_URI: hdfs://192.168.137.200:9000/user/hive/warehouse
6.            NAME: default
7.      OWNER_NAME: public
8.      OWNER_TYPE: ROLE

详细的元数据信息记录在basic01下的database_params表中

1. mysql> select * from database_params \G;
2. *************************** 1. row ***************************
3.       DB_ID: 21
4.   PARAM_KEY: creator
5. PARAM_VALUE: zhangsan
6. *************************** 2. row ***************************
7.       DB_ID: 21
8.   PARAM_KEY: date
9. PARAM_VALUE: 2018-08-08
10. 2 rows in set (0.00 sec)

2. 查询数据库

SHOW (DATABASES|SCHEMAS) [LIKE ‘identifier_with_wildcards’]
like 后面跟数据库或表的关键字，可模糊查询
```
         1. hive> show databases;
         2. OK
         3. default
         4. word
         5. wordcount
```

3. 查询数据库信息

DESCRIBE DATABASE [EXTENDED] db_name;

DESCRIBE DATABASE db_name：查看数据库的描述信息和文件目录位置路径信息；
EXTENDED：显示数据库详细属性信息。

1. hive> describe database hive2;
2. OK
3. hive2   it is my database       hdfs://192.168.137.200:9000/user/hive/warehouse/hive2.db        hadoop      USER
4. Time taken: 0.119 seconds, Fetched: 1 row(s)

5. hive> describe database extended hive2;
6. OK
7. hive2   it is my database       hdfs://192.168.137.200:9000/user/hive/warehouse/hive2.db        hadoop      USER    {date=2018-08-08, creator=zhangsan}
8. Time taken: 0.135 seconds, Fetched: 1 row(s)

4. 修改数据库信息

ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, …);
ALTER (DATABASE|SCHEMA) database_name SET OWNER [USER|ROLE] user_or_role;

ALTER (DATABASE|SCHEMA) database_name SET LOCATION hdfs_path;

1. hive> alter database hive2 set dbproperties ("update"="lisi");
2. OK
3. hive> alter database hive2 set owner user wjx;
4. OK

查看数据库信息：
修改前

1. hive> describe database extended hive2;
2. OK
3. hive2   it is my database       hdfs://192.168.137.200:9000/user/hive/warehouse/hive2.db        hadoop      USER    {date=2018-08-08, creator=zhangsan}
4. Time taken: 0.135 seconds, Fetched: 1 row(s)

修改后

1. hive> describe database extended hive2;
2. OK
3. hive2   it is my database       hdfs://192.168.137.200:9000/user/hive/warehouse/hive2.db        wjx USER     {date=2018-08- 08, creator=zhangsan, update=lisi}
5. Time taken: 0.235 seconds, Fetched: 1 row(s)

5. 删除数据库

DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];
RESTRICT：默认是restrict，如果该数据库还有表存在则报错；
CASCADE：级联删除数据库(当数据库还有表时，级联删除表后在删除数据库)。
```
1. hive> drop database hive2;
2. OK
```

表

表的数据类型
int,
long,
float/double,
string,
boolean(布尔值）
分隔符
行默认分隔符：“\n”
列指定分隔符：”\t”

1. 创建表

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
– (Note: TEMPORARY available in Hive 0.14.0 and later)
[(col_name data_type [COMMENT col_comment], … [constraint_specification])]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], …)]
[CLUSTERED BY (col_name, col_name, …) [SORTED BY (col_name [ASC|DESC], …)] INTO num_buckets BUCKETS]
[SKEWED BY (col_name, col_name, …) – (Note: Available in Hive 0.10.0 and later)]
ON ((col_value, col_value, …), (col_value, col_value, …), …)
[STORED AS DIRECTORIES]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY ‘storage.handler.class.name’ [WITH SERDEPROPERTIES (…)]
– (Note: Available in Hive 0.6.0 and later)
]

[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, …)]
– (Note: Available in Hive 0.6.0 and later)
[AS select_statement];
– (Note: Available in Hive 0.5.0 and later; not supported for external tables)

1.1 TEMPORARY（临时表）

Hive从0.14.0开始提供创建临时表的功能，表只对当前session有效，session退出后，表自动删除。
```
1. CREATE TEMPORARY TABLE ...  
```

1.2 EXTERNAL（外部表）

Hive上有两种类型的表，一种是Managed Table(默认的)，另一种是External Table（加上EXTERNAL关键字）。它俩的主要区别在于：当我们drop表时，Managed Table会同时删去data（存储在HDFS上）和meta data（存储在MySQL），而External Table只会删meta data。
```
1. hive> create table managed_table(
2.    > id int,
3.     > name string 
4.     > );
```

创建外部表

1. hive> create external table external_table(
2.     > id int,
3.     > name string 
4.     > );

查询HDFS上的数据

1. [hadoop@zydatahadoop001 ~]$ hdfs dfs -ls /user/hive/warehouse
2. Found 7 items
3. drwxr-xr-x   - hadoop supergroup          0 2017-12-23 00:21 /user/hive/warehouse/external_table
4. drwxr-xr-x   - hadoop supergroup          0 2017-12-22 14:42 /user/hive/warehouse/helloword
5. drwxr-xr-x   - hadoop supergroup          0 2017-12-22 17:28 /user/hive/warehouse/hive1.db
6. drwxr-xr-x   - hadoop supergroup          0 2017-12-22 17:40 /user/hive/warehouse/hive2.db
7. drwxr-xr-x   - hadoop supergroup          0 2017-12-23 00:06 /user/hive/warehouse/managed_table
8. drwxr-xr-x   - hadoop supergroup          0 2017-12-22 14:58 /user/hive/warehouse/word.db
9. drwxr-xr-x   - hadoop supergroup          0 2017-12-22 15:34 /user/hive/warehouse/wordcount.db

表的元数据信息

1. mysql> select * from tbls \G;

2. *************************** 4. row ***************************
3.             TBL_ID: 11
4.        CREATE_TIME: 1513958794
5.              DB_ID: 1
6.   LAST_ACCESS_TIME: 0
7.              OWNER: hadoop
8.          RETENTION: 0
9.              SD_ID: 11
10.           TBL_NAME: managed_table
11.           TBL_TYPE: MANAGED_TABLE
12. VIEW_EXPANDED_TEXT: NULL
13. VIEW_ORIGINAL_TEXT: NULL
14. *************************** 5. row ***************************
15.             TBL_ID: 13
16.       CREATE_TIME: 1513959668
17.              DB_ID: 1
18.  LAST_ACCESS_TIME: 0
19.              OWNER: hadoop
20.          RETENTION: 0
21.              SD_ID: 13
22.           TBL_NAME: external_table
23.           TBL_TYPE: EXTERNAL_TABLE
24. VIEW_EXPANDED_TEXT: NULL
25. VIEW_ORIGINAL_TEXT: NULL

删除表后查询俩表信息

1. hive> drop table  managed_table;
2. OK
3. Time taken: 0.807 seconds
4. hive> drop table external_table;
5. OK

1. [hadoop@zydatahadoop001 ~]$ hdfs dfs -ls /user/hive/warehouse
2. Found 6 items
3. drwxr-xr-x   - hadoop supergroup          0 2017-12-23 00:21 /user/hive/warehouse/external_table
4. drwxr-xr-x   - hadoop supergroup          0 2017-12-22 14:42 /user/hive/warehouse/helloword
5. drwxr-xr-x   - hadoop supergroup          0 2017-12-22 17:28 /user/hive/warehouse/hive1.db
6. drwxr-xr-x   - hadoop supergroup          0 2017-12-22 17:40 /user/hive/warehouse/hive2.db
7. drwxr-xr-x   - hadoop supergroup          0 2017-12-22 14:58 /user/hive/warehouse/word.db
8. drwxr-xr-x   - hadoop supergroup          0 2017-12-22 15:34 /user/hive/warehouse/wordcount.db

删除表后，内部表数据和元数据同时删除，外部表只删除元数据信息

1.3 [(col_name data_type [COMMENT col_comment], … [constraint_specification])] [COMMENT table_comment]

col_name：字段名；
data_type：字段类型；
COMMENT col_comment：字段的注释；
[COMMENT table_comment]：表的注释。

1. hive> create table student(
2.     > id int comment '学号',
3.     > name string comment '姓'
4.     > )
5.     > comment 'this is student information table';
6. OK

1.4 PARTITIONED BY（分区表）

对表进行以字段分区，可以提高对大数据信息的查询，hive表中，可分为静态分区和动态分区，一般来说，动态分区较为方便。

静态分区

可以根据PARTITIONED BY创建分区表，一个表可以拥有一个或者多个分区，每个分区以文件夹的形式单独存在表文件夹的目录下；
分区是以字段的形式在表结构中存在，通过describe table命令可以查看到字段存在，但是该字段不存放实际的数据内容，仅仅是分区的表示。
分区建表分为2种，一种是单分区，也就是说在表文件夹目录下只有一级文件夹目录。另外一种是多分区，表文件夹下出现多文件夹嵌套模式。
单分区表创建：
```
1. hive> CREATE TABLE order_partition (
2.     > order_number string,
3.     > event_time string
4.     > )
5.     > PARTITIONED BY (event_month string);
6. OK
```

将order.txt 文件中的数据加载到order_partition表中

1. hive> load data local inpath '/home/hadoop/order.txt' overwrite into table order_partition partition (event_month='2014-05');

2. hive> select * from order_partition;

3. 10703007267488  2014-05-01 06:01:12.334+01      2014-05
4. 10101043505096  2014-05-01 07:28:12.342+01      2014-05
5. 10103043509747  2014-05-01 07:50:12.33+01       2014-05
6. 10103043501575  2014-05-01 09:27:12.33+01       2014-05
7. 10104043514061  2014-05-01 09:03:12.324+01      2014-05

使用hadoop 命令创建分区，并把数据put 上去

1. [hadoop@zydatahadoop001 ~]$ hadoop fs -mkdir -p /user/hive/warehouse/order_partition/event_month=2014-06
2. [hadoop@zydatahadoop001 ~]$ hadoop fs -put /home/hadoop/order.txt /user/hive/warehouse/order_partition/event_month=2014-06

3. 上传完成后查看表order_partition
4. hive> select * from order_partition
5.     > ;
6. OK
7. 10703007267488  2014-05-01 06:01:12.334+01      2014-05
8. 10101043505096  2014-05-01 07:28:12.342+01      2014-05
9. 10103043509747  2014-05-01 07:50:12.33+01       2014-05
10. 10103043501575  2014-05-01 09:27:12.33+01       2014-05
11. 10104043514061  2014-05-01 09:03:12.324+01      2014-05
12. Time taken: 2.034 seconds, Fetched: 5 row(s)

可以看到通过put上去的数据并没有显示，原因是hdfs上数据确实存在，但mysql中元数据信息并没有记录，可以通过以下命令解决：
msck repair table order_partition;
再次查看：

1. hive> select * from order_partition;
2. OK
3. 10703007267488  2014-05-01 06:01:12.334+01      2014-05
4. 10101043505096  2014-05-01 07:28:12.342+01      2014-05
5. 10103043509747  2014-05-01 07:50:12.33+01       2014-05
6. 10103043501575  2014-05-01 09:27:12.33+01       2014-05
7. 10104043514061  2014-05-01 09:03:12.324+01      2014-05
8. 10703007267488  2014-05-01 06:01:12.334+01      2014-06
9. 10101043505096  2014-05-01 07:28:12.342+01      2014-06
10. 10103043509747  2014-05-01 07:50:12.33+01       2014-06
11. 10103043501575  2014-05-01 09:27:12.33+01       2014-06
12. 10104043514061  2014-05-01 09:03:12.324+01      2014-06

多分区

1. hive>  CREATE TABLE order_partition2 (
2.     > order_number string,
3.     > event_time string
4.     > )
5.     >  PARTITIONED BY (event_month string, step string);
6. OK

加载数据:

    1. hive> load data local inpath '/home/hadoop/order.txt' overwrite into table order_multi_partition partition (event_month='2014-05',step=1);

查询：

    1. hive> select * from order_multi_partition;
    2. OK
    3. 10703007267488  2014-05-01 06:01:12.334+01      2014-05 1
    4. 10101043505096  2014-05-01 07:28:12.342+01      2014-05 1
    5. 10103043509747  2014-05-01 07:50:12.33+01       2014-05 1
    6. 10103043501575  2014-05-01 09:27:12.33+01       2014-05 1
    7. 10104043514061  2014-05-01 09:03:12.324+01      2014-05 1

查询对应hdfs路径

    1. [hadoop@zydatahadoop001 ~]$ hdfs dfs -ls /user/hive/warehouse/order_multi_partition/event_month=2014-05
    18/01/09 
    2. Found 1 items
    3. drwxr-xr-x   - hadoop supergroup          0 2018-01-09 22:52 /user/hive/warehouse/order_multi_partition/event_month=2014-05/step=1

可以看到多分区对应的路径文件是多级嵌套。

动态分区

定义：DP列的指定方式与SP列相同 - 在分区子句中（ Partition关键字后面），唯一的区别是，DP列没有值，而SP列有值（ Partition关键字后面只有key没有value）；
在INSERT … SELECT …查询中，必须在SELECT语句中的列中最后指定动态分区列，并按PARTITION（）子句中出现的顺序进行排列；
所有DP列 - 只允许在非严格模式下使用。在严格模式下，我们应该抛出一个错误。
如果动态分区和静态分区一起使用，必须是动态分区的字段在前，静态分区的字段在后。
hive表中默认的是静态分区，要使用动态分区需要进行如下设置：
set hive.exec.dynamic.partition=true; （开启动态分区）
set hive.exec.dynamic.partition.mode=nonstrict;（指定动态分区模式，默认为strict，即必须指定至少一个分区为静态分区，nonstrict模式表示允许所有的分区字段都可以使用动态分区。）
永久使用动态分区需要将其配置到hive-site.xml文件中，

*创建动态分区

    1. CREATE TABLE emp_dynamic_partition (
    2. empno int,
    3. ename string,
    4. job string,
    5. mgr int,
    6. hiredate string,
    7. salary double,
    8. comm double
    9. )
    10. PARTITIONED BY (deptno int)

使用静态分区

    1. CREATE TABLE emp_partition (
    2. empno int,
    3. ename string,
    4. job string,
    5. mgr int,
    6. hiredate string,
    7. salary double,
    8. comm double
    9. )
    10. PARTITIONED BY (deptno int)
    11. ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t";


    1. insert into table emp_partition partition(deptno=10)
    select empno,ename ,job ,mgr ,hiredate ,salary ,comm from emp where deptno=10;

    2. insert into table emp_partition partition(deptno=20)
    select empno,ename ,job ,mgr ,hiredate ,salary ,comm from emp where deptno=20;

    3. insert into table emp_partition partition(deptno=30)
    select empno,ename ,job ,mgr ,hiredate ,salary ,comm from emp where deptno=30;

查询结果：

    1. hive> select * from emp_partition;
    2. OK
    3. 7782    CLARK   MANAGER 7839    1981/6/9        2450.0  NULL    10
    4. 7839    KING    PRESIDENT       NULL    1981/11/17      5000.0  NULL    10
    5. 7934    MILLER  CLERK   7782    1982/1/23       1300.0  NULL    10
    6. 7369    SMITH   CLERK   7902    1980/12/17      800.0   NULL    20
    7. 7566    JONES   MANAGER 7839    1981/4/2        2975.0  NULL    20
    8. 7788    SCOTT   ANALYST 7566    1987/4/19       3000.0  NULL    20
    9. 7876    ADAMS   CLERK   7788    1987/5/23       1100.0  NULL    20
    10. 7902    FORD    ANALYST 7566    1981/12/3       3000.0  NULL    20
    11. 7499    ALLEN   SALESMAN        7698    1981/2/20       1600.0  300.0   30
    12. 7521    WARD    SALESMAN        7698    1981/2/22       1250.0  500.0   30
    13. 7654    MARTIN  SALESMAN        7698    1981/9/28       1250.0  1400.0  30
    14. 7698    BLAKE   MANAGER 7839    1981/5/1        2850.0  NULL    30
    15. 7844    TURNER  SALESMAN        7698    1981/9/8        1500.0  0.0     30
    16. 7900    JAMES   CLERK   7698    1981/12/3       950.0   NULL    30

使用动态分区

    1. insert into table emp_dynamic_partition partition(deptno)
    select empno,ename ,job ,mgr ,hiredate ,salary ,comm, deptno from emp;

    2. hive> select * from emp_dynamic_partition;
    3. OK
    4. 7782    CLARK   MANAGER 7839    1981/6/9        2450.0  NULL    10
    5. 7839    KING    PRESIDENT       NULL    1981/11/17      5000.0  NULL    10
    6. 7934    MILLER  CLERK   7782    1982/1/23       1300.0  NULL    10
    7. 7369    SMITH   CLERK   7902    1980/12/17      800.0   NULL    20
    8. 7566    JONES   MANAGER 7839    1981/4/2        2975.0  NULL    20
    9. 7788    SCOTT   ANALYST 7566    1987/4/19       3000.0  NULL    20
    10. 7876    ADAMS   CLERK   7788    1987/5/23       1100.0  NULL    20
    11. 7902    FORD    ANALYST 7566    1981/12/3       3000.0  NULL    20
    12. 7499    ALLEN   SALESMAN        7698    1981/2/20       1600.0  300.0   30
    13. 7521    WARD    SALESMAN        7698    1981/2/22       1250.0  500.0   30
    14. 7654    MARTIN  SALESMAN        7698    1981/9/28       1250.0  1400.0  30
    15. 7698    BLAKE   MANAGER 7839    1981/5/1        2850.0  NULL    30
    16. 7844    TURNER  SALESMAN        7698    1981/9/8        1500.0  0.0     30
    17. 7900    JAMES   CLERK   7698    1981/12/3       950.0   NULL    30

可以看到使用动态分区只需要一条命令即可完成表的分区操作，很方便

1.5 ROW FORMAT

指定列分隔符：
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’

1.6 STORED AS（存储格式）

默认存储格式为text

2. Create Table As Select (CTAS)

以select语句结果创建一张表，创建的表具有前表的格式和结果，会有mr

3. LIKE

使用like创建表时，只会创建表的结构，并无数据。

4. desc formatted table_name

以格式的形式显示出表的信息内容

5. 查询表

show tables;
show create table table_name;

6. 删除表

DROP TABLE [IF EXISTS] table_name [PURGE]; – (Note: PURGE available in Hive 0.14.0 and later）
指定PURGE后，数据不会放到回收箱，会直接删除。
DROP TABLE删除此表的元数据和数据。如果配置了垃圾箱（并且未指定PURGE），则实际将数据移至.Trash / Current目录。元数据完全丢失。
删除EXTERNAL表时，表中的数据不会从文件系统中删除。

7. 修改表（Alter Table）

修改表名
ALTER TABLE table_name RENAME TO new_table_name;
修改分区
ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION ‘location’][, PARTITION partition_spec [LOCATION ‘location’], …];

partition_spec:

(partition_column = partition_col_value, partition_column = partition_col_value,

来自@若泽大数据