Hive -DDL

朵&朵

于 2024-08-02 17:16:33 发布

阅读量485

点赞数 17

文章标签： hive 数据库 hadoop

本文链接：https://blog.csdn.net/weixin_65642229/article/details/140864943

版权

1. 数据库操作

#建立数据库
create database [if not exists] 库名;
#使用数据库
use 库名;
#查看hive建库语句
describe database [extended] 库名;
desc database [extended] 库名;
#删除数据库
drop database [if exists] 库名 [cascade];
注：cascade 表示强制删除,当库中有表数据时也会删除，不加cascade删除有表数据的库时会删除不成功
#修改数据库（包括数据库属性、数据库所有者、数据库位置）
alter database 库名 set (.....);

2. hive的建表语句

create [external] table [if exists] 表名(
字段 类型 约束,
字段 类型 约束,
字段 类型 约束
)
partition by (分区字段)  -- 表分区
clustered by (分桶字段) sorted by (排序字段) into n buckets   -- 表分桶 
row format delimited  .....     -- 字段分割
storted  as 存储格式    -- 表的存储格式 
location 'hdfs上的表存放位置'   -- 表在HDFS上的存放位置
tblpropertiles (压缩格式)   -- 表的压缩格式(一般为ORC、parquet存储格式的表压缩时需要先声明)
;

3. 普通表操作

3.1 建立表

#建立内/外部表
create [external] table [if not exists] 表名(
字段 类型 约束,
字段 类型 约束,
字段 类型 约束
)row format delimited
fields terminated by '分隔符'  --字段之间分隔符
collection items terminated by '分隔符'  --集合元素之间分隔符,搭配类型array、map使用
map keys terminated by '分隔符'  --map映射中kv之间的分隔符
lines terminated by '分隔符'   --行数据之间的分隔符
location 'hdfs上的表存放位置'
;

常用类型：

数值：int(整型)、double(浮点型)、date(日期)、string(字符串)、array<string/int等>(数组)、map<string,string>(字典)

hdfs的表默认存放路径：/user/hive/warehouse/

external ：

建表时不加external时表示建立内部表，当删除表时有关该映射下的数据文件也会被删除。[不推荐]

建表时加external时表示建立外部表，当删除表时有关该映射下的数据文件不会被删除。[推荐]

3.2 修改表

#以表格形式显示元数据
describe formatted 表名;
#更改表名
alter table 原表名 rename to 新表名;
#更改表属性
alter table 表名 set (Serde属性、表存储格式、表存储路径);
#更改列名
alter table 表名 change 原字段名 新字段名 类型 [first/after 某字段];
#增加/替换列
alter table 表名 add|replace columns(字段名 类型 约束);

3.3 删除表

drop table [if exists] 表名 [purge垃圾桶删除];
truncate 表名;

3.4 加载数据

overwrite表示将表的原有数据覆盖再加载现有数据；

into表示将现有数据追加到原有数据的后面；

#数据文件在linux中
linux中写：hadoop fs -put 'linux文件路径' '创建的表在hdfs上存放的路径';
DataGrip中写：load data local inpath 'linux文件路径' [overwrite]|[into] table 表名;
#数据文件在HDFS中
DataGrip中写：load data inpath 'HDFS文件路径' [overwrite]|[into] table 表名;
注：此时HDFS中的文件加载有剪切效果，即原文件不存在

例如

假设数据如下：

1,孙悟空,53,西部大镖客:288-大圣娶亲:888-全息碎片:0-至尊宝:888-地狱火:1688
2,鲁班七号,54,木偶奇遇记:288-福禄兄弟:288-黑桃队长:60-电玩小子:2288-星空梦想:0
3,后裔,53,精灵王:288-阿尔法小队:588-辉光之辰:888-黄金射手座:1688

建表外部表且放置在默认路径下：

默认路径：HDFS上的/user/hive/warehouse/

create external table if nor exists game(
id int,
name string,
win_rate int,
skin_price map<string,int>
)row format delimited
fields terminated by ','   
collection items terminated by '-'   
map keys terminated by ':'  
;

加载数据：

#数据文件在linux中
linux中写：hadoop fs -put 'linux文件路径' '创建的表在hdfs上存放的路径';
DataGrip中写：load data local inpath 'linux文件路径' overwrite table game;
#数据文件在HDFS中
DataGrip中写：load data inpath 'HDFS文件路径' overwrite table game;
注：此时HDFS中的文件加载有剪切效果，即原文件不存在

4. 存储格式

4.1 行式存储

Textfile

Stored as textfile (默认的存储格式)

优点：相关的数据保存在一起，一行数据就是一条记录，适合进行insert/update操作

缺点：查询只涉及某几个列时，效率会很低，且不适合建立索引

4.2 列式存储

Orc：Stored as orc

parquet：Stored as parquet

优点：可做索引，一列数据保存在一起，只查询某几个列时，效率很高

缺点：不适合insert/update

4.3 存储方式对比

存储文件的压缩比及速度：orc > parquet > textfile

因此存储一般选择：orc或parquet

5. 压缩

优点：减少存储磁盘空间，减少网络传输带宽

缺点：需要花费额外的CPU、时间来做压缩和解压缩计算

压缩一般选择snappy的压缩

5.1 声明压缩格式

存储格式为textfile的压缩

Stored as textfile

在create 表时不需要先声明，可以直接压缩为.snappy

存储格式为ORC的压缩

Stored as orc

Tblproperties (“orc.compress”=”Snappy”)

在create 表时需要先声明再压缩为.snappy

存储格式为parquet的压缩

Stored as parquet

Tblproperties (“parquet.compress”=”Snappy”)

在create 表时需要先声明再压缩为.snappy

5.2 开启hive中间传输数据（map阶段）压缩功能

1）开启hive中间传输数据压缩功能

set hive.exec.compress.intermediate=true;

2）开启mapreduce中map输出压缩功能

set mapreduce.map.output.compress=true;

3）设置mapreduce中map输出数据的压缩方式

set mapreduce.map.output.compress.codec= org.apache.hadoop.io.compress.SnappyCodec;

5.3 开启Reduce输出阶段压缩

1）开启hive最终输出数据压缩功能

set hive.exec.compress.output=true;

2）开启mapreduce最终输出数据压缩

set mapreduce.output.fileoutputformat.compress=true;

3）设置mapreduce最终数据输出压缩方式

set mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.SnappyCodec;

4）设置mapreduce最终数据输出压缩为块压缩

set mapreduce.output.fileoutputformat.compress.type=BLOCK;

5.4 传输文件验证

此时再hadoop fs -put ....时文件格式为.snappy

朵&朵

关注

17
点赞
踩
12

收藏

觉得还不错? 一键收藏
0
评论
Hive -DDL

数值：int(整型)、double(浮点型)、date(日期)、string(字符串)、array<string/int等>(数组)、map<string,string>(字典)建表时不加external时表示建立内部表，当删除表时有关该映射下的数据文件也会被删除。建表时加external时表示建立外部表，当删除表时有关该映射下的数据文件不会被删除。优点：可做索引，一列数据保存在一起，只查询某几个列时，效率很高。在create 表时不需要先声明，可以直接压缩为.snappy。
复制链接

扫一扫