Hive SQL初识

Quinn-ntmy

已于 2023-10-25 09:51:40 修改

阅读量64

点赞数

文章标签： hive sql hadoop

于 2023-09-02 22:07:10 首次发布

本文链接：https://blog.csdn.net/weixin_43421371/article/details/132642297

版权

本文详细介绍了HiveSQL在数据库和数据表操作中的各个方面，包括创建、修改、查看、删除数据库，以及对内部表和外部表的操作，如建表、插入、删除分区、分桶和数据加载，以及排序方法和Hive与MySQL的区别。

摘要由CSDN通过智能技术生成

Hive

Hive是sql语言，通过数据库的方式来操作HDFS文件系统，为了简化编程，底层计算方式为MapReduce。
Hive是面向行存储的数据库。
Hive本身不存储和计算数据，它完全依赖于HDFS和MapReduce，Hive中的表纯逻辑。

Hive SQL

DDL

一、对数据库的操作

~~以下xx表示数据库名~~

创建数据库：
create database

> create database if not exists xx;
> [note：Hive表存放位置模式是由hive-site.xml当中的属性hive.metastore.warehouse.dir指定的]
> 
> # 创建数据库并指定HDFS存储位置 
> create database xx location '/xx';

修改数据库
alter database

> alter database xx set bdproperties('createtime'='20230405');
> [note：虽然可以修改数据库的一些属性，但元数据信息不会更改的。]

查看数据库信息

查看基本信息
hive > desc database xx;

查看更多详细信息
hive > desc database extended xx;

删除数据库
drop database

> 删除一个空数据库，如果下面有数据表就会报错
> drop database xx;

> 强制删除数据库，包含数据库下面的表一起删除
> drop database xx cascade;

二、对数据表的操作

对管理表（内部表）的操作

建内部表 create table

hive (myhive)> use myhive;  -- 使用myhive数据库
hive (myhive)> create table stu(id int, name string);
hive (myhive)> insert into stu values (1, "zhangsan");
hive (myhive)> insert into stu values (1, "zhangsan"), (2, "lisi");
hive (myhive)> select * from stu;

字段类型
例如decimal(11, 2)代表最多有11位数字，其中后2位是小数，整数部分是9位；如果整数部分超过9位，则这个字段就会变成null；如果小数部分不足2位，则后面用0补齐，如果小数部分超过2位，则超出部分四舍五入。
直接写decimal不指定，默认是decimal(10, 0)。
创建表并指定字段之间的分隔符

create table if not exists stu2(id int, name string) row format delimited fields terminated by '\t' stored as textfile location '/user/stu2';

row format delimited fields terminated by ‘\t’ 指定字段分隔符，默认分隔符是’\001’
stored as 指定存储格式
location 指定存储位置

根据查询结果创建表

create table stu3 as select * from stu2;

根据已经存在的表结构创建表

create table stu4 like stu2;

查询表的结构

> 只查询表内字段及属性
> desc stu2;

> 详细查询
> desc formatted stu2;

查询创建表的语句

show create table stu2;

对外部表操作
外部表是制定其他的HDFS路径的数据加载到表当中来

构建外部表 create external table

create external table student (s_id string, s_name string) row format delimited fields terminated by '\t';

从本地文件系统向表中加载数据

> 追加操作
> load data local inpath '/export/servers/hivedatas/student.csv' into table student;

> 覆盖操作（加关键词overwrite）
> load data local inpath '/export/servers/hivedatas/student.csv' overwrite into table student;

从HDFS文件系统向表中加载数据

> load data inpath '/hivedatas/techer.csv' into table techer;

> 加载数据到指定分区
> load data inpath '/hivedatas/techer.csv' into table techer partition(cur_date=20221112);

（1）使用 load data local 表示从本地文件系统加载，文件会拷贝到HDFS上；
（2）使用 load data 表示从HDFS文件系统加载，文件会直接移动到Hive相关目录下，并不是拷贝；
（3）如果表是分区表，load 时不指定分区会报错；
（4）如果加载相同文件名的文件，会被自动重命名。

对分区表的操作

创建分区表

create table score(s_id string, s_score int) partitioned by (month string);

创建一个表带多个分区

create table score2(s_id string, s_score int) partitioned by (year string, month string, day string);

加载数据到一个分区的表中

load data local inpath '/export/servers/hivedatas/score.csv' into table score partition (month='202006');

加载数据到一个多分区的表中

load data local inpath '/export/servers/hivedatas/score.csv' into table score2 partition (year='2020', month='06', day='01');

查看分区

show partitions score;

添加一个分区

alter table score add partition(month='202006');

同时添加多个分区

alter table score add partition(month='202006') partition(month='202005');

添加分区之后可以在HDFS文件系统中看到下面多了一个文件夹

删除分区

alter table score drop partition(month='202006');

对分桶表操作
将数据按照指定的字段分到多个桶中，就是按照分桶字段进行哈希划分到多个文件当中。
分区就是分文件夹，分桶就是分文件。
分桶优点：
（1）提高 join 查询效率
（2）提高抽样效率

开启Hive桶表功能

set hive.enfore.bucketing=true;

设置reduce的个数

set mapreduce.job.reduces=3'

创建桶表

> create table course (c_id string, c_name string) clustered by(c_id) into 3 buckets;

桶表的数据加载只能通过 insert overwrite 进行加载，所以把文件加载到桶表中，需要先创建普通表，并通过 insert overwrite 的方式将普通表的数据通过查询的方式加载到桶表中。

加载数据

> insert overwrite table course select * from course_common cluster by(c_id);

修改表和删除表

修改表名称

> alter table old_table_name rename to new_table_name;

增加 / 修改列信息

> 查询表结构
> desc score5;

> 添加列
> alter table score5 add columns (mycol string, mysco string);

> 更新列
> alter table score5 change column mysco mysconew int;

删除表操作

> drop table score5;

清空表操作

> truncate table score6;
> 只能清空管理表，即内部表；清空外部表会产生错误

如果HDFS开启了回收站，drop删除的表数据可以从回收站恢复，表结构无法恢复，需要重新创建；truncate清空的表不进回收站，所以无法恢复truncate清空的表。

向Hive表中加载数据

直接向分区表插入

> insert into table score partition(month = '202007') values ('001', '002', '100')

通过load方式加载数据

> load data local inpath '/export/servers/hivedatas/score.csv' overwrite into table score partition(month='202006');

通过查询方式加载数据

> insert overwrite table score2 partition(month='202006') select s_id, c_id, s_score from score1;

查询语句中创建表并加载数据

> create table score2 as select * from score1;

在创建表是通过location指定加载数据的路径

> create external table score6 (s_id string, c_id string, s_score int) row format delimited fields terminated by ',' location '/myscore';

export导出与import导入Hive表数据（内部表操作）

create table teacher2 like teacher;   -- 根据已有表结构创建表
export table teacher to '/export/teacher';
import table teacher2 from '/export/teacher';

Hive表中数据导出

insert导出
Hadoop命令导出到本地
hive shell命令导出
export导出到HDFS上

DQL

单表查询

order by 会对输入做全局排序，因此只有一个reducer，这会导致当输入规模较大时，需要较长的计算时间；
sort by 不是全局排序，在数据进入reducer之前完成排序。因此，如果用sort by进行排序并且设置mapred.reduce.tasks>1，则sort by只保证每个reducer的输出有序，不保证全局有序；
distribute by (字段) 根据指定的字段将数据分到不同的reducer，且分发算法是hash散列；
cluster by (字段) 除了具有distribute by的功能外，还会对该字段进行排序。因此，如果分桶和sort字段是同一个时，cluster by = distribute by + sort by。

基本查询语法和MySQL类似，下面对几个排序进行详细说明。

order by
全局排序，最后只有一个reducer，即在一个节点执行，如果数据量太大就会耗费较长时间。
sort by
局部排序，每个mapreduce内部进行排序，对全局结果集来说不是排序。
distribute by
分区排序，类似MR中的partition，进行分区，结合sort by使用
Hive中要求distribute by写在sort by之前。

Hive和MySQL的区别

1. 语法

Hive语法检测更严格：Hive大小写不敏感，但内容（where，if / case when）和路径名区分大小写。

2. 子查询

MySQL子查询支持完整
Hive
（1）Hive不支持if或case when里的子查询
（2）Hive中主查询的引用仅在子查询的where子句中支持

3. 存储

MySQL：存储数据
Hive：不存储数据，只是映射的HDFS的结构化文件（所以要使表的分隔符与文件的分隔符一致）

4. 数据模型

MySQL：只有基本数据类型
Hive：复合数据类型Array、Map、Struct、Uniontype

5. 排序

MySQL：order by
Hive：order by、sort by、distribute by、cluster by

6. 抽样查询

未完。。。

Quinn-ntmy

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hive SQL初识

Hive是sql语言，通过数据库的方式来操作HDFS文件系统，为了简化编程，底层计算方式为MapReduce。Hive是面向行存储的数据库。Hive本身不存储和计算数据，它完全依赖于HDFS和MapReduce，Hive中的表纯逻辑。
复制链接

扫一扫