hive对库的操作

Aricya

已于 2022-10-21 14:32:55 修改

阅读量619

点赞数 1

文章标签： hadoop hive 大数据

于 2022-10-21 11:12:09 首次发布

本文链接：https://blog.csdn.net/Aricya/article/details/127441345

版权

前言

Apache Hive（TM）数据仓库软件有助于使用 SQL 读取，写入和管理驻留在分布式存储中的大型数据集。建立在 Apache Hadoop（TM）之上，它提供：

通过 SQL 轻松访问数据的工具，从而实现数据仓库任务，例如提取 / 转换 / 加载（ETL），报告和数据分析

一种将结构强加于各种数据格式的机制

访问直接存储在 Apache HDFS（TM）或其他数据存储系统（例如 Apache HBase（TM））中的文件

使用 Apache Hadoop MapReduce，Apache Tez 或 Apache Spark 框架执行查询。

Hive 提供了标准的 SQL 功能，包括 2003 年和 2011 年以后的许多分析功能。其中包括 OLAP 函数，子查询，公用表表达式等。Hive 的 SQL 也可以通过用户定义的函数（UDF），用户定义的集合（UDAF）和用户定义的表函数（UDTF）扩展为用户代码。

一 Hive对库的操作

1.1创建库

简单方式

create database t1;
show databases;

创建库的时候带注释

create database if not exists t2 comment 'learning hive';

创建带属性的库

create database if not exists t3 with dbproperties('creator'='hadoop','date'='2019-01-01');

1.2查看库

查看库

show databases;

在这里插入图片描述

显示数据库的详细属性信息

desc database t3;

desc database extended t3;

查看正在使用哪个库

select current_database();

1.3删除库

默认情况下，hive 不允许删除包含表的数据库。需要使用cascade 关键字

drop database if exists t3 cascade;

1.4切换库

切换到t2数据库

use t2;

二 Hive对表的操作

2.1内部表与外部表

内部表

create table test_1(id int,name string,salary bigint,addr string)
row format delimited
fields terminated by ‘,’;

外部表
创建的时候，需要使用external关键字

create external table test_2(id int,name string,salary bigint,addr string)
row format delimited
fields terminated by ‘,’
location ‘/worker’;

查询数据如图
在这里插入图片描述

2.2数据的导入导出

导入
将hive服务器运行所在节点的本地磁盘上的文件导入表中
这里load的文件是在开启server的节点上
如图所示：创建个文本文档
在这里插入图片描述

load data local inpath '/usr/datadir/worker_1.txt' into table test_1;

在这里插入图片描述

加overwrite可以实现覆盖，不加overwrite是追加到表后面

load data local inpath ‘/usr/datadir/worker_1.txt’ overwrite into table worker_1;

在这里插入图片描述

load data inpath ‘/test_2’ into table test_2; worker_2会被移动到表目录下
加local是复制，不加是移动。从别的表查询数据后插入到一张新建的表中。表会自动生成。
从别的表查询数据后插入到一张新建的表中。表会自动生成

create table worker_3
as
select id,name,salary
from worker_2
where salary>=10000;

从别的表查询数据后插入到一张已经存在的表中。表已经存在。

insert into table worker_1
select id,name,salary,addr
from worker_2
where salary>=15000;

into替换为overwrite，会覆盖掉原表得数据。
导出
将数据从hive的表中导出到hdfs的目录中
insert overwrite directory ‘/test_dir’
select * from test_1;
将数据从hive的表中导出到本地磁盘的目录中
insert overwrite local directory ‘/usr/datadir/test.log’
select * from test_1;

2.3 hive的复杂数据类型

现有数据：
在这里插入图片描述
建表语句：

create table movie_info(
id int, 
name string, 
work_location array<string>, 
piaofang map<string,bigint>, 
address struct<location:string,zipcode:int,phone:string,value:int>)
row format delimited 
fields terminated by " "
collection items terminated by ","
map keys terminated by ":" ;

导入数据：

load data local inpath "/usr/datadir/movie_info.txt" into table movie_info;

查询语句：

select * from movie_info;

在这里插入图片描述

2.4 hive的文件存储格式

创建seq表，对应的文件类型是sequencefile。

create table test_seq(id int,name string)
stored as sequencefile;

将从别的表查询的数据放入到seq中

insert into test_seq 
select id,name from test_1;

将查询出来的数据直接使用sequencefile保存。

create table test_seq
stored as sequencefile
as
select *from test_1;

在这里插入图片描述
查询seq表的信息

select * from test_seq;

将查询出来的数据直接使用orc保存

create table test_orc_1
stored as orc
as
select *from test_1;

在这里插入图片描述
将查询出来的数据直接使用parquet保存

create table test_par
stored as parquet
as
select *from test_1;

在这里插入图片描述

2.5 查看表信息

新建表

create table student(id int,name string,age int)
row format delimited
fields terminated by “,”;

查看表信息

desc student;

查看表的详细信息

desc extended student;
desc formatted student;

查看表的详细建表语句

show create table student;

2.6 修改表

**
修改表名
在这里插入图片描述
修改字段
增加一个字段

alter table new_student add columns (score int);

修改一个字段的定义

alter table new_student change name new_name string;

2.7 删除和清空表

删除表

drop table new_student;

在这里插入图片描述
清空表

truncate table student;

在这里插入图片描述

三 Hive的分区表

3.1 分区表的创建

create table test_4(id int,name string,salary bigint,addr string)
. . . . . . . . . . . . . . . . . . > partitioned by (day string)
. . . . . . . . . . . . . . . . . . > row format delimited
. . . . . . . . . . . . . . . . . . > fields terminated by ‘,’;

在这里插入图片描述

3.2 导入数据到分区

load data local inpath '/usr/datadir/worker_1.txt' into table test_4 partition(day='01');
load data local inpath '/usr/datadir/worker_1.txt' into table test_4 partition(day='02');

例如：查询worker_4下day=01目录下的信息
在这里插入图片描述
**

3.3 增删分区

**
查看分区信息

show partitions test_4;

增加分区：

alter table test_4 add partition(day='03') partition(day='04');

通过加载数据实现添加分区：

load data local inpath '/usr/datadir/test_3.txt' into table test_4 partition(day='05');

每个分区都对应一个表目录下的子目录
在这里插入图片描述
还可以使用insert实现分区：

insert into table test_4 partition(day='06')
select * from test_2 where salary>=5000;

删除分区：

alter table test_4 drop partition(day='06');

3.4 动态分区

新建表

create table student(id int,name string,sex string,age int,department string)
row format delimited fields terminated by ",";
load data local inpath '/usr/datadir/student.txt' into table student;

创建分区表

create table student_ptn_age(id int,name string,sex string,department string) 
partitioned by (age int);

在这里插入图片描述

insert overwrite table student_ptn_age partition(age)
select id,name,sex,department,age from student;

四 Hive的分桶表

4.1分桶的概念

分桶是相对分区进行更细粒度的划分（数据取样更高效）。分桶将整个数据内容按照某列属性值的hash值进行区分，如要安装name属性分为3个桶，就是对name属性值的hash 值对3取摸，按照取模结果对数据分桶。如取模结果为0的数据记录存放到一个文件，取模为1的数据存放到一个文件，取模为2的数据存放到一个文件。

4.2分桶操作

创建分桶表

create table student_bck(id int, name string)
clustered by (id) into 3 buckets #3个桶
row format delimited fields terminated by ",";

向桶中插入数据

insert overwrite table student_bck
select id,name from student;

查看存储信息
在这里插入图片描述
查看分桶数据

select * from student_bck tablesample(bucket 1 out of 2on id);//tablesample (bucket x out of y on id);

在这里插入图片描述
x表示从哪个桶(x-1)开始，y代表分几个桶，也可以理解分x为分子，y为分母，及将表分为y份（桶），取第x份（桶）

4.3 视图

create view v_name(字段)
as
select * from student;

视图不能load数据，也不能insert。只能用来进行查询
视图是一个逻辑的概念，并不是物理上存在的
查看视图
在这里插入图片描述
删除drop view v_name;

name，age，idcard，cardnum
create view v_name(字段)
as
select name,age from table;

4.4 Hive的表关联操作

4.4.1join操作

数据准备：

[root@node01hive]# cat order.txt

在这里插入图片描述

[root@node01hive]#cat goods.txt

在这里插入图片描述

创建表，导入数据：//表名不能跟sql语句名产生冲突比如order

create table t_order(
orderid int,name string)
row format delimited
fields terminated by ",";

load data local inpath '/opt/testData/hive/order.txt' into table t_order;

在这里插入图片描述

create table t_goods(
goodid int,price int)
row format delimited
fields terminated by ",";

load data local inpath '/opt/testData/hive/goods.txt' into table t_goods;

在这里插入图片描述

Hive中的join分为了内连接、左外连接、右外连接、全外连接
内连接：inner join

select * from
t_order inner join t_goods
on order_id = goods_id;

在这里插入图片描述
只会把相同关联条件匹配上的数据保留下来
±-----------------±--------------±----------------±---------------+
| t_order.orderid | t_order.name | t_goods.goodid | t_goods.price |
±-----------------±--------------±----------------±---------------+
| 112 | 皮鞋 | 112 | 500 |
| 114 | 耳机 | 114 | 130 |
| 116 | 可乐 | 116 | 5 |
| 110 | 钢笔 | 110 | 50 |
±-----------------±--------------±----------------±---------------+

左外连接： left join

select * from 
t_order left join t_goods
on order_id = goods_id;

全外连接：

select *from
 t_order full join t_goods
on order_id = goods_id;

在这里插入图片描述
±-----------------±--------------±----------------±---------------+
| t_order.orderid | t_order.name | t_goods.goodid | t_goods.price |
±-----------------±--------------±----------------±---------------+
| 110 | 钢笔 | 110 | 50 |
| 112 | 皮鞋 | 112 | 500 |
| 114 | 耳机 | 114 | 130 |
| 116 | 可乐 | 116 | 5 |
| NULL | NULL | 119 | 800 |
| 121 | 鼠标 | NULL | NULL |
±-----------------±--------------±----------------±---------------+
4.5笛卡尔积//hive本身不支持后面会讲到
笛卡尔积：假设集合A={a, b}，集合B={0, 1, 2}，则两个集合的笛卡尔积为{(a, 0), (a, 1), (a, 2), (b, 0), (b, 1), (b, 2)}
两张表没有关联条件，但是需要关联的时候，就可以使用笛卡尔积。

create table doctor_info(name string,second_dep string,illness string)
row format delimited fields terminated by ",";


load data local inpath '/opt/testData/hive/doctor.txt' into table doctor_info;


create table department_info(first_dep string,keyword string)
row format delimited fields terminated by ",";

load data local inpath '/opt/testData/hive/department.txt' into table department_info;

set hive.mapred.mode=nonstrict;

create table hospital_info
as
select * 
from doctor_info
join department_info
on 1=1;


select * 
from hospital_info
where second_dep regexp keyword;

4.6 union和union all

union关联的时候会对数据进行去重，union all不会

select * from test_5 union select * from test_7;

在这里插入图片描述

select * from test_5 union all select * from test_7;

在这里插入图片描述

五总结

现在虽然有很多SQL ON Hadoop的解决方案，像Spark SQL、Impala、Presto等等，但就目前来看，在基于Hadoop的大数据分析平台、数据仓库中，Hive仍然是不可替代的角色。尽管它的相应延迟大，尽管它启动MapReduce的时间相当长，但是它太方便、功能太强大了，做离线批量计算、ad-hoc查询甚至是实现数据挖掘算法，而且，和HBase、Spark都能整合使用。如果你是做大数据分析平台和数据仓库相关的，就目前来说，我建议，Hive是必须的。

Aricya

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hive对库的操作

现在虽然有很多SQL ON Hadoop的解决方案，像Spark SQL、Impala、Presto等等，但就目前来看，在基于Hadoop的大数据分析平台、数据仓库中，Hive仍然是不可替代的角色。尽管它的相应延迟大，尽管它启动MapReduce的时间相当长，但是它太方便、功能太强大了，做离线批量计算、ad-hoc查询甚至是实现数据挖掘算法，而且，和HBase、Spark都能整合使用。如果你是做大数据分析平台和数据仓库相关的，就目前来说，我建议，Hive是必须的。
复制链接

扫一扫