Hive DDL数据定义语言详解

最新推荐文章于 2023-01-30 18:19:02 发布

beautiful_huang

最新推荐文章于 2023-01-30 18:19:02 发布

阅读量389

点赞数 1

分类专栏： Hadoop Hive 文章标签： hadoop

本文链接：https://blog.csdn.net/beautiful_huang/article/details/103497957

版权

Hadoop 同时被 2 个专栏收录

27 篇文章 0 订阅

订阅专栏

Hive

13 篇文章 0 订阅

订阅专栏

文章目录

1、database 操作

create database if not exists XXX ;   // 建库
use XXX                               // 使用库
show databases                        // 查看库
describe database default             // 显示库的详细信息
alter database XXX set owner user dayongd // 更改库属性
drop database if exists XXX cascade   // 删除库和里面的表

补充：
hive的注释是“–”
如何查看当前所处的库？

hive> select current_database();
OK
mydemo

2、hive table 操作

1）外部表建立

建立外部表需要用到 location 关键字
对于外部表相当于建立了一个表的快捷方式来管理数据，表的路径是hdfs上的路径，跟数据库的数据路径是不同的。外部表本身来说，表对于里面的数据不具有管理权限，当调用hive语句drop外部表时，表被删除，但是数据还在，对数据起到了相关的保护作用。

create external table  if not exists origninfos(
id string,
name string,
sex string,
age int
)
comment '这是一个注释'
row format delimited fields terminated by ' '
stored as textfile 
location '/orign';

-- 使用关键字 external 来建立外部表，如果不使用则默认建立内部表
-- ow format delimited fields terminated by  为固定格式
-- stored as textfile 对应的表的数据文件作文本文件·

2）内部表建立

表的路径即是设置数据库的默认路径，数据放到内部表时，是完全被hive进行管理，删除表时，是在元数据层面删除，同时删除数据

create  table  if not exists orign(
id string,
name string,
sex string,
age int
)
comment '这是一个注释'
row format delimited fields terminated by ' '
stored as textfile;

插入数据
hive> insert into orign values('1','nini','man',20);

//批量导入数据
load data inpath '/user/a.txt' into table myuser;

3）临时表

create temporary table tmp as ...
create temporary table tmp like ...

4）高阶建表

建立空表

create table employ like employee

建立带数据的表

creaet table empoly as
select * from employee

5) CTE

create table employ
with
r1 as (select name from r2 where name = 'XXX')
r2 as (select name from employee where sex = 'man')

补充：

外部表与内部表的概念

未被external修饰的是内部表（managed table），被external修饰的为外部表（external table）；

外部表与内部表的区别（管理与删除）

内部表数据由Hive自身管理，外部表数据由HDFS管理；
内部表数据存储的位置是hive.metastore.warehouse.dir（默认：/user/hive/warehouse），外部表数据的存储位置由自己制定（如果没有LOCATION，Hive将在HDFS上的/user/hive/warehouse文件夹下以外部表的表名创建一个文件夹，并将属于这个表的数据存放在这里）；
删除内部表会直接删除元数据（metadata）及存储数据；删除外部表仅仅会删除元数据，HDFS上的文件并不会被删除；
对内部表的修改会将修改直接同步给元数据，而对外部表的表结构和分区进行修改，则需要修复（MSCK REPAIR TABLE table_name;）

外部表与内部表的使用情境

外部表：

外部表用来存储元数据；
在共享数据的时候，可以通过新建一个外部表，通过路径来共享

内部表：

在进行数据转换、数据清洗的时候。建立内部表

查看表结构命令：

desc XXX；
desc formatted XXX； //查看详细表结构

显示建表语句

show create table XXX；

3、分区表

分区表实际上就是对应一个HDFS文件系统上的独立的文件夹，该文件夹下是该分区所有的数据文件。Hive中的分区就是分目录，把一个大的数据集根据业务需要分割成小的数据集。在查询时通过WHERE子句中的表达式选择查询所需要的指定的分区，这样的查询效率会提高很多

主要用于提高性能
分区列的值将表划分为segments(文件夹)
查询时使用“分区”列和常规列类似
查询时Hive自动过滤掉不用于提高性能的分区
分为静态分区和动态分区

1）实战演示如何在hive中使用动态分区

1、创建一张分区表，包含两个分区dt和ht表示日期和小时

```bash
CREATE TABLE partition_table001   
(  
    name STRING,  
    ip STRING  
)  
PARTITIONED BY (dt STRING, ht STRING)  
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t

2、启用hive动态分区，只需要在hive会话中设置两个参数：

set hive.exec.dynamic.partition=true;  
set hive.exec.dynamic.partition.mode=nonstrict;

3、把partition_table001表某个日期分区下的数据load到目标表partition_table002
使用静态分区时，必须指定分区的值，如：

create table if not exists partition_table002 like partition_table001;  
 
insert overwrite table partition_table002 partition (dt='20150617', ht='00') select name, ip from partition_table001 where dt='20150617' and ht='00';

此时我们发现一个问题，如果希望插入每天24小时的数据，则需要执行24次上面的语句。而动态分区会根据select出的结果自动判断数据改load到哪个分区中去。

4、使用动态分区

insert overwrite table partition_table002 partition (dt, ht) select * from partition_table001 where dt='20150617';

hive先获取select的最后两个位置的dt和ht参数值，然后将这两个值填写到insert语句partition中的两个dt和ht变量中，即动态分区是通过位置来对应分区值的。原始表select出来的值和输出partition的值的关系仅仅是通过位置来确定的，和名字并没有关系，比如这里dt和st的名称完全没有关系。

2）静态分区使用

-- 静态分区
hive> create table stu2(
    > id int ,
    > name string,
    > like array<string>,
    > address map<string,string>
    > )
    > partitioned by (age int,sex string)
    > row format delimited
    > fields terminated by ','
    > collection items terminated by '-'
    > map keys terminated by ':'
    > lines terminated by '\n';
OK
Time taken: 0.076 seconds

hive> load data local inpath '/opt/data/stu.csv' into table stu2
    > partition (age=20,sex='male');
	
-- 静态分区删除：
hive> alter table stu2 drop partition(age=10,sex='male');
Dropped the partition age=10/sex=male
OK
Time taken: 0.421 seconds

3）静态分区和动态分区可以混合使用

1、全部DP

INSERT OVERWRITE TABLE T PARTITION (ds, hr)  
SELECT key, value, ds, hr FROM srcpart WHERE ds is not null and hr>10;

2、DP/SP结合

INSERT OVERWRITE TABLE T PARTITION (ds='2010-03-03', hr)  
SELECT key, value, /*ds,*/ hr FROM srcpart WHERE ds is not null and hr>10;

3、当SP是DP的子分区时，以下DML会报错，因为分区顺序决定了HDFS中目录的继承关系，这点是无法改变的

-- throw an exception  
INSERT OVERWRITE TABLE T PARTITION (ds, hr = 11)  
SELECT key, value, ds/*, hr*/ FROM srcpart WHERE ds is not null and hr=11;

4、多张表插入

FROM S  
INSERT OVERWRITE TABLE T PARTITION (ds='2010-03-03', hr)  
SELECT key, value, ds, hr FROM srcpart WHERE ds is not null and hr>10  
INSERT OVERWRITE TABLE R PARTITION (ds='2010-03-03, hr=12)  
SELECT key, value, ds, hr from srcpart where ds is not null and hr = 12;

5、CTAS，（CREATE-AS语句），DP与SP下的CTAS语法稍有不同，因为目标表的schema无法完全的从select语句传递过去。这时需要在create语句中指定partition列

6、上面展示了DP下的CTAS用法，如果希望在partition列上加一些自己的常量，可以这样做

CREATE TABLE T (key int, value string) PARTITIONED BY (ds string, hr int) AS  
SELECT key, value, ds, hr+1 hr1 FROM srcpart WHERE ds is not null and hr>10;

小结：
通过上面的案例，我们能够发现使用hive中的动态分区特性的最佳实践：对于那些存在很大数量的二级分区的表，使用动态分区可以非常智能的加载表，而在动静结合使用时需要注意静态分区值必须在动态分区值的前面

3）分区表基本操作

1．引入分区表（需要根据日期对日志进行管理）

/user/hive/warehouse/log_partition/20170702/20170702.log
/user/hive/warehouse/log_partition/20170703/20170703.log
/user/hive/warehouse/log_partition/20170704/20170704.log

2．创建分区表语法

hive (default)> create table dept_partition(
deptno int, dname string, loc string
)
partitioned by (month string)
row format delimited fields terminated by '\t';

3．加载数据到分区表中

hive (default)> create table dept_partition(
deptno int, dname string, loc string
)
partitioned by (month string)
row format delimited fields terminated by '\t';

4．查询分区表中数据
单分区查询

hive (default)> select * from dept_partition where month='201709';

多分区联合查询

hive (default)> select * from dept_partition where month='201709'
              union
              select * from dept_partition where month='201708'
              union
              select * from dept_partition where month='201707';

_u3.deptno      _u3.dname       _u3.loc _u3.month
10      ACCOUNTING      NEW YORK        201707
10      ACCOUNTING      NEW YORK        201708
10      ACCOUNTING      NEW YORK        201709
20      RESEARCH        DALLAS  201707
20      RESEARCH        DALLAS  201708
20      RESEARCH        DALLAS  201709
30      SALES   CHICAGO 201707
30      SALES   CHICAGO 201708
30      SALES   CHICAGO 201709
40      OPERATIONS      BOSTON  201707
40      OPERATIONS      BOSTON  201708
40      OPERATIONS      BOSTON  201709

5．增加分区
创建单个分区

hive (default)> alter table dept_partition add partition(month='201706') ;

同时创建多个分区

hive (default)> alter table dept_partition add partition(month='201705') partition(month='201704');

6．删除分区
删除单个分区

hive (default)> alter table dept_partition drop partition (month='201704');

同时删除多个分区

hive (default)> alter table dept_partition drop partition (month='201705'), partition (month='201706');

7．查看分区表有多少分区

hive> show partitions dept_partition;

8．查看分区表结构

hive> desc formatted dept_partition;

# Partition Information          
# col_name              data_type               comment             
month                   string

4）分区表注意事项

1．创建二级分区表

hive (default)> create table dept_partition2(
               deptno int, dname string, loc string
               )
               partitioned by (month string, day string)
               row format delimited fields terminated by '\t';

2．正常的加载数据
（1）加载数据到二级分区表中

hive (default)> load data local inpath '/opt/module/datas/dept.txt' into table
 default.dept_partition2 partition(month='201709', day='13');

（2）查询分区数据

hive (default)> select * from dept_partition2 where month='201709' and day='13';

3．把数据直接上传到分区目录上，让分区表和数据产生关联的三种方式
（1）方式一：上传数据后修复
上传数据

hive (default)> dfs -mkdir -p
 /user/hive/warehouse/dept_partition2/month=201709/day=12;
hive (default)> dfs -put /opt/module/datas/dept.txt  /user/hive/warehouse/dept_partition2/month=201709/day=12;

查询数据（查询不到刚上传的数据）

hive (default)> select * from dept_partition2 where month='201709' and day='12';

执行修复命令

hive> msck repair table dept_partition2;

再次查询数据

hive (default)> select * from dept_partition2 where month='201709' and day='12';

（2）方式二：上传数据后添加分区
上传数据

hive (default)> dfs -mkdir -p
 /user/hive/warehouse/dept_partition2/month=201709/day=11;
hive (default)> dfs -put /opt/module/datas/dept.txt  /user/hive/warehouse/dept_partition2/month=201709/day=11;

执行添加分区

hive (default)> alter table dept_partition2 add partition(month='201709',
 day='11');

查询数据

hive (default)> select * from dept_partition2 where month='201709' and day='11';

（3）方式三：创建文件夹后load数据到分区
创建目录

hive (default)> dfs -mkdir -p
 /user/hive/warehouse/dept_partition2/month=201709/day=10;

上传数据

hive (default)> load data local inpath '/opt/module/datas/dept.txt' into table
 dept_partition2 partition(month='201709',day='10');

查询数据

hive (default)> select * from dept_partition2 where month='201709' and day='10';

beautiful_huang

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hive DDL数据定义语言详解

文章目录一、Apache hive 简介1.1 什么是hive1.2 为什么使用Hive二、Apache hive的架构2.1 hive的组件1）用户接口：2）元数据存储：2.2 hive和hadoop的关系2.3 hive与传统数据库对比2.4 hive的数据模型三、hive安装配置1、mysql安装2、解压、环境变量配置4、引入 mysql 连接 jar包5、格式化删除hadoop安全模...
复制链接

扫一扫