Hive----【DDL操作、对数据表的操作】

最新推荐文章于 2024-05-27 09:20:55 发布

CoderBoom

最新推荐文章于 2024-05-27 09:20:55 发布

阅读量764

点赞数

分类专栏：大数据 hive 文章标签： HIve DDL操作操作数据表

本文链接：https://blog.csdn.net/CoderBoom/article/details/84311751

版权

大数据同时被 2 个专栏收录

44 篇文章

订阅专栏

hive

10 篇文章

订阅专栏

Hive–DDL基本操作

Hive中错误分类 :

Error while compiling statement  hive编译器错误  sql语法问题
Error while processing statement hive执行期错误  应用逻辑上的问题

1. DDL操作

1.1 创建表

建表语法

create [external] table [if not exists(判断有无表)] table_name
[(col_name(列名) data_type(数据类型) [comment col_comment(数据描述)], ...)]
[comment table_comment(数据表描述)]
[partitioned by (col_name data_type [comment col_comment], ...)]
[clustered by (col_name, col_name, ...)
[sorted by (col_name [asc|desc], ...)] into num_buckets buckets]
[row format row_format]
[stored as file_format]
[location hdfs_path]

说明:

创建表: create table [if not exsists] 表名(列,数据类型);

例子 : 创建一个名为t_test的表create table t_test(id int,name string,age int);此时采用的是默认的分隔符'\001' 用 vi 编辑器 Ctrl+v 然后 Ctrl+a 即可输入’\001’ -----------> ^A

通常我们需要指定分隔符!!!

准备数据

在根目录创建一个文件专门存放hsql的测试数据hivedata
mkdir hivedata
进入该目录 创建数据文件 
vi 1.txt
1,zhangsan,18
2,lisi,19
3,wangwu,20

create table t_test(id int,name string,age int) row format delimited fields terminated by ",";

将数据传送到hdfs的hive存放数据的指定目录下

hadoop fs -put 1.txt /user/hive/warehouse/test01.db/t_test1

在远程连接设备上查看表即可select * from t_test1;

总结 :

结构化文件位置不能瞎放 hive中建立的表有对应的文件夹存在

test01.db-->t_test01--->/user/hive/warehouse/test01.db/t_test1

分隔符不一定需要指定 , 因为默认有个分隔符'\001'
建表的数据类型一定要跟结构化文件一致、顺序一致

只有当结构化数据映射成功一张表之后，就可以使用hive sql来对数据进行分析。

hive的数据类型
- 除了支持sql的类型外还支持java类型且大小写不敏感
hive的分隔符
- row format delimited | serde

row format delimited
[fields terminated by char]
[collection items terminated by char]
[map keys terminated by char]
[lines terminated by char] | serde serde_name
[with serdeproperties

row format delimited fields terminated by ","
row format:表明开始指定分隔符
delimited :表明使用hive内置分隔符类来处理数据分割（默认LazySimpleSerDe）
fields : 指的是字段分隔符的指定
collection : 指的是集合数据的分隔符
map : 指的是map数据的分隔符

练习 :

简单类型

allen|19|beijing
tom|20|shanghai
  
create table t_t1(name string,age int,city string) row format delimited fields terminated by '|';
加载数据(如果不在hdfs上 , 在hive服务器所在的linux上 , 是复制操作)
load data local inpath '/root/hivedata/1.txt' into table t_t1;
如果在hdfs的根目录下(是移动操作)
load data inpath '/1.txt' into table t_t1;

复杂类型

如果是复杂类型：
  zhangsan	beijing,shanghai,tianjin,hangzhou
  wangwu	shanghai,chengdu,wuhan,haerbin
  
create table complex_array(name string,work_locations array<string>) row format delimited fields terminated by '\t' collection items terminated by ',';

1,zhangsan,唱歌:非常喜欢-跳舞:喜欢-游泳:一般般
2,lisi,打游戏:非常喜欢-篮球:不喜欢
create table t_map(id int,name string,hobby map<string,string>)
	row format delimited 
	fields terminated by ','
	collection items terminated by '-'
	map keys terminated by ':' ;

hive建表的时候如果语法中木有row format，表面不指定分隔符，这时候使用默认分隔符去创建表

“\001”------>用 vi 编辑器Ctrl+v 然后 Ctrl+a 即可输入’\001’—>^A

如果此时结构化数据恰好也是\001分割这时候不指定分隔符也能映射成功 , 如下图

^A字端在vi编辑器中是变色的

1542612176745

1542612196795

hive 中数据库表跟hdfs中位置的映射对应关系

hive默认会把数据存放在hdfs : /user/hive/warehouse

database ------>/user/hive/warehouse/db_name.db
database.table---->/user/hive/warehouse/db_name.db/table_name

创建外部表(…external…location…) , external关键字可以让用户创建一个外部表 , 在建表的同时指定一个指向实际数据的路径（ location）

create external table allen(id int,name string,age int) row format delimited fields terminated by ',' location '/allenwoon';

Hive 创建内部表时，会将数据移动到数据仓库指向的路径；若创建外部表，仅记录数据所在的路径，不对数据的位置做任何改变。在删除表的时候，内部表的元数据和数据会被一起删除，而外部表只删除元数据，不删除数据。

测试 :

首先我们在hdfs上新建一个文件夹用于存储数据"/hivedata" , 然后创建测试数据1.txt

1,hello,2
3,hi,4
5,see,6

创建表

create external table t_t1(id int,name string,age int) row format delimited fields terminated by "," location '/hivedata';

外部表意味着数据可以在hdfs任意路径下 , 不需要移动到hive的默认路径下

外部表的好处
- 内部表：删除表除了删除hive中的表信息还会把对应的结构化文件删除
- 外部表：仅仅删除hive中的表信息不删除表的文件

分区表 partitioned by

在hive select查询中一般会扫描整个表内容 , 会消耗很多时间做没必要的工作 , 有时候只需要扫描表中关心的一部分数据 , 因此建表时引入了partition分区概念 .

分区表值得是在创建表时指定的partition的分区空间 . 一个表可以拥有一个或者多个分区 , 每个分区以文件夹的形式单独存在表文件的目录下 . 表和列名不区分大小写 . 分区是以字段的形式在表结构中存在 , 通过describe table命令可以查看到字段存在 , 但是该字段不存放实际的数据内容 , 仅仅是分区的表示.

分区建表分为两种: 一种是单分区 , 也就是说在表文件夹目录下只有一级文件夹目录 . 另一种是多分区 , 表文件夹下出现多文件夹嵌套模式 .

单分区建表语句 : create table day_table(id int,content string) partitioned by (dt string); ==>单分区表 , 按天分区 , 在表结构中存在id , content , dt三列
- 导入数据 : load data local inpath '/root/hivedata/dat_table.txt' into table day_table partition(dt='2017-07-07')
- 使用select * from day_table where dt="xxx";
双分区建表语句 : create table day_hour_table (id int, content string) partitioned by (dt string, hour string);==> 双分区表，按天和小时分区，在表结构中新增加了dt和hour两列
- 导入数据 : load data local inpath '/root/hivedata/dat_table.txt' into table day_hour_table partition(dt='2017-07-07', hour='08');
- 使用select * from day_table where dt="xxx" and hour="yyy";

多分区表测试

创建

create table t_user_duo(id int, name string,country string) partitioned by (guojia string, sheng string) row format delimited fields terminated by ',';
------
create table t_user_duo(id int, name string,city string) partitioned by (province string, xian string) row format delimited fields terminated by ',';

加载

LOAD DATA local INPATH '/root/hivedata/china.txt' INTO TABLE t_user_duo PARTITION(guojia='zhongguo', sheng='hebei');
------------
load data local inpath '/root/hivedata/beijing.txt' into table t_user_duo partition(province='beijing', xian='shunyi');

使用

select * from t_user_duo where guojia ="zhongguo" and sheng ="jiangsu";

总结

多分区当前只支持两个分区后一个分区的意思是指在前一个分区的基础上再次细分。

体现的就是分区的文件夹下面继续创建第二个分区的文件夹

常见多分区使用：

(province, city)

(month ,day)

基于分区的查询 : select day_table.* from day_table where day_table.dt = '2017-07-07';
查看分区 : show partitions day_hour_table;
多分区当前只支持两个分区 , 后一个分区的意思是指在前一个分区的基础上再次划分

注意 : dt后面的参数便是分区的文件的名字

实战 :

数据 , 分别创建三个数据文件

vi china.txt
1,zhangsan,china
2,lisi,china
3,wangwu,china
----------
vi usa.txt
4,jack,usa
5,tom,usa
----------
vi japan.txt
6,haoo,japan
7,wood,japan

下面两个操作都是在node-3上执行的

创建一个表

create table t_user(id int,name string,country string) partitioned by (guojia string) row format delimited fields terminated by ","

导入数据

load data local inpath '/root/hivedata/china.txt' into table t_user1 partition(guojia='zhongguo');

load data local inpath '/root/hivedata/usa.txt' into table t_user1 partition(guojia='meiguo');

load data local inpath '/root/hivedata/japan.txt' into table t_user1 partition(guojia='riben');

tips : 直接put不行 , 分区字段没有指定 , 不能映射成功

查询

select * from t_user where guojia="zhongguo"; 使用分区字段查询 , 只扫描分区字段对应的文件
select * from t_user where country="china"; 使用表中字段查询 , 全表扫描查询

目录结构如下图 :

在这里插入图片描述

查询结果如下图 :

在这里插入图片描述

分区表总结 :

分区表是一个查询优化手段 , 减少了查询时的全局扫描
分区字段不能跟表中字段重复!!
分区字段是虚拟字段 , 并不真正存在表数据中 , 它的显示数据来自于加载数据时的分区指定
load data local inpath '/root/hivedata/china.txt' into table t_user1 partition(guojia='zhongguo');
分区字段可以直接用在sql中 , 单做查询条件 , 优化查询
分区的意义在于从文件夹层面把文件管理的更加精致
总的说来partition就是辅助查询，缩小查询范围，加快数据的检索速度和对数据按照一定的规格和条件进行管理。

**扩展 : **企业中 , 如果创建分区表 , 使用什么作为分区字段 ?

如果一天一个文件 partitioned by (day string)
如果是每个省一个文件 partitioned by (province string)

分区表作用如下图

在这里插入图片描述

分桶表(分簇表) clustered by xxx into N buckets

clustered by xxx into N buckets

字面上 : 根据xxx分为N桶

通俗上 : 把表对应的文件按照xxx字段分为N个部分

根据谁分 : clustered by xxx 根据xxx分

分成几个部分 : N buckets N就是几个buk

如何分 : 默认分桶规则 hashfunc()

如果分桶字段是数值类型 hashfunc(xxx) = xxx xxx%N 余数是几就分到哪个桶

如果分桶字段是字符串类型 hashfunc(xxx) = xxx.hashcode xxx.hashcode%N 余数是几就分到哪个桶

分桶表的创建

准备数据vi students.txt

95001,李勇,男,20,CS
95002,刘晨,女,19,IS
95003,王敏,女,22,MA
95004,张立,男,19,IS
95005,刘刚,男,18,MA
95006,孙庆,男,23,CS
95007,易思玲,女,19,MA
95008,李娜,女,18,CS
95009,梦圆圆,女,18,MA
95010,孔小涛,男,19,CS
95011,包小柏,男,18,MA
95012,孙花,女,20,CS
95013,冯伟,男,21,CS
95014,王小丽,女,19,CS
95015,王君,男,18,MA
95016,钱国,男,21,MA
95017,王风娟,女,18,IS
95018,王一,女,19,IS
95019,邢小丽,女,19,IS
95020,赵钱,男,21,IS
95021,周二,男,17,MA
95022,郑明,男,20,MA

开启分桶的功能

set hive.enforce.bucketing = true;

指定分为几桶

set mapreduce.job.reduces=4;

分桶表创建

create table stu_buck(Sno int,Sname string,Sex string,Sage int,Sdept string)
clustered by(Sno) 
into 4 buckets
row format delimited
fields terminated by ',';

分桶表导入数据

真正加载分桶表数据方式：insert+select

insert+values  插入的数据来自于values指定
insert+select  插入的数据来自于后select查询语句返回的结果

首先创建临时表student
create table student(Sno int,Sname string,Sex string,Sage int,Sdept string)
row format delimited
fields terminated by ',';

然后给临时表加载数据
hadoop fs -put students.txt /user/hive/warehouse/itcast.db/student

最后查询临时表（student）数据 分桶插入到最终分桶表中（stu_buck）
insert overwrite table stu_buck select * from student cluster by(Sno);
扩展 :
当两个字段一样时: 
cluster（分且排序，必须一样）==distribute（分） + sort（排序）（可以不一样 , 指的是条件可以不一样）
cluster 和 sort 不能共存
对某列进行分桶的同时，根据另一列进行排序
insert overwrite table stu_buck
select * from student distribute by(Sno) sort by(Sage asc);

tips:

直接put只能映射成功数据 , 但是并没有按照Sno分成4个桶 , 违背初衷

hive就不支持load语法

成功后如下图所示

在这里插入图片描述

发现数据已经分好为4个部分了

分桶表的总结 :

分桶表的意思是把表的数据会按照指定的字段分成指定的几个桶（部分）
分桶的字段必须是表中已有的字段
分桶功能默认不开启需要手动打开
分桶的好处优化join查询减少笛卡尔积
分桶需要创建一个中间表用于传递数据 , 因为分桶需要查询之后放入桶中

分桶的好处如下图

在这里插入图片描述

1.2 修改表

增加分区 :

一次添加一个分区

alter table table_name add partition (dt='20170101') location '/user/hadoop/warehouse/table_name/dt=20170101';
执行添加分区时   /table_name文件夹下的数据不会被移动。并且没有分区目录dt=2008-08-08 
-----
alter table t_u add partition(province='guangzhou') location '/user/hive/warehouse/test01.db/t_u/province=guangzhou';

一次添加多个分区

alter table table_name add partition (dt='2008-08-08', country='us') location '/path/to/us/part080808' partition (dt='2008-08-09', country='us') location '/path/to/us/part080809';
-----
alter table t_user_duo add partition (province='hainan',xian='haizhu') location '/user/hive/warehouse/test01.db/t_user_duo/province=hainan/xian=haizhu' partition (province='shenzhen2',xian='shen')location '/user/hive/warehouse/test01.db/t_user_duo/province=shenzhen2/xian=shen';

删除分区

alter table table_name drop if exists partition (dt='2008-08-08');
alter table table_name drop if exists partition (dt='2008-08-08', country='us');

修改分区

alter table table_name partition (dt='2008-08-08') rename to partition (dt='20080808');

添加列

alter table table_name add|replace columns (col_name string);
注： ADD 是代表新增一个字段，新增字段位置在所有列后面 (partition 列前 )
REPLACE 则是表示替换表中所有字段。#少用

修改列

alter table test_change change a a1 int; //修改 a 字段名==>(a1 int,b int,c int);
alter table test_change change a a1 string after b; //修改a字段名及类型 , 并与字段b交换位置==>(b int,a1 string,c int);
alter table test_change change b b1 int first; //修改字段b为b1并放在第一个位置==>(b1 int,a int,c int);
alter table table_name rename to new_table_name;//表重命名

1.3 显示命令

show tables; : 显示当前数据库所有表
show databases|schemas; : 显示所有数据库
show partitions table_name : 显示表分区信息 , 不是分区表执行报错
show functions; : 显示当前版本hive支持的所有方法
desc extended table_name : 查看表信息
- hive中表类型
```
MANAGED_TABLE  内部表 
EXTERNAL_TABLE  外部表
```
desc formatted table_name : 查看表信息(格式化美观)
- 这里我们可以看到表的详细信息
describe database database_name; : 查看数据库相关信息

hive解析数据

	首先使用InputFormat去读文件（默认一行一行读），读取一行交给SerDe处理，SerDe按照分隔符进行切割，切割出来一个列，对应hvie中一个字段。
	
InputFormat:org.apache.hadoop.mapred.TextInputFormat  
SerDe：org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

LazySimpleSerDe就是语法中delimited 所代表的默认分隔符类