大数据之Hive之DML(数据操作语言)

最新推荐文章于 2022-04-02 10:33:12 发布

有态度的程序猿

最新推荐文章于 2022-04-02 10:33:12 发布

阅读量169

点赞数

文章标签： hive 大数据

本文链接：https://blog.csdn.net/guoxizhang/article/details/115096221

版权

本文详细介绍了Hive的数据操作，包括数据导入、导出、查询等方面。在查询部分，讲解了各种函数如count(), max, min, sum, avg等，以及Limit, Where, Group By, Having, Join, 分区等高级查询技巧。此外，还提到了Order By, Sort By, Distribute By和Cluster By在数据排序和分布中的应用。" 80487458,7603692,Unity3D实现登录注册功能,"['游戏开发', 'Unity引擎', 'C#编程', '数据存储', '前端开发']

摘要由CSDN通过智能技术生成

1.数据导入
1.1load装载数据

load data [local] inpath '数据的path' [overwrite] into table student [partition (partcol1=val1, ...)];

-- 测试表
create table student (id int,name string) row format delimited fields terminated by '\t';

-- load 数据之追加数据 本地导入 是复制进去的
load data local inpath '/opt/module/hive/datas/student.txt' into table student;

-- load 数据之覆盖数据
load data local inpath '/opt/module/hive/datas/student.txt' overwrite into table student;

-- load 数据之hdfs导入 hdfs导入时剪切进去的
load data inpath '/student.txt' into table student;

1.2insert插入数据

-- 追加插入
insert into table student2 values(1,'banzhang'),(2,'haiwang');

-- 覆盖插入
insert overwrite table student values  (1,'banzhang'),(2,''haiwnag);

-- 查询插入  -- 注意:第一你所插入的表必须存在 然后你查询的字段必须满足目标表里的字段数
insert into table student select id,name from student3;

-- 查询覆盖
insert overwrite table student select id,name from student3;

1.3as select

create table if not exists student3
as select id,name from student;

create as select ,insert into table table_name select这两个就是拿来创建中间表

1.4location

create table if not exists student4(id int,name string)
row format delimited fields terminated by '\t'
location '/student4';
-- 指定location 必须是文件夹

1.5import导入(必须是export导出并且导入的表不能存在)

import table student6 from '/user/hive/warehouse/export/student'

2.数据导出(少)
2.1insert导出

-- 无格式导出
insert overwrite local directory '/opt/module/hive/datas/export/student1' select * from student;

--有格式的导出
insert overwrite local directory '/opt/module/hive/datas/export/student1' row format delimited fileds terminated by '\t' select * from student;

-- 没有local 写在hdfs上
insert overwrite directoy '/opt/module/hive/datas/export/export/student1' row format delimited filelds terminated by '\t' select * from student;

2.2hadoop下载

hadoop fs -get /user/hive/warehouse/student/student.txt  /opt/module/hive/datas/export/student3.txt;

2.3hive的shell命令

hive -e 'select * from default.student' > /opt/module/hive/datas/export/student4.txt

2.4export导出

export table student to '/student';

查询

1.查询简介

select [all | distinct] select_expr,select_expr,...  -- distinct  表示对结果集去重

from table_reference                                  -- 从那个表拿取数据

[where where_condition]                              -- 查询之前过滤

[group by col_list]                                  -- 以...分组

[having col_list]                                   --对分组之后的结果过滤

[order by col_list]                                 --对结果集做全局排序

[cluster by col_list | [distribute by col_list] [sort by col_list]]  -- hive里面4个by

[limit number]                                      -- 限制输出的个数 翻页

from < join <where<group by<count(*)<having<select<order by<limit   --sql执行顺序

常用函数
1.求总行数:count()
例如:

select count(*) cnt from emp;

2.求最大值:max
3.最小值:min
4.总和:sum
5.平均值:avg
Limit语句
典型的查询会返回多行数据.Limit字句用于限制返回的行数

where语句
1.使用where字句,将不满足条件的行过滤掉
2.where字句紧随from字句
比较运算符(Between/In/Is Null)

在这里插入图片描述

Like和RLike
1.使用Like运算选择类似的值
2.选择条件可以包含字符或数字
%代表0个或多个字符(任意个字符).
_代表一个字符
3.rlike字句
rlike字句是Hive中这个功能的一个扩展,其可以通过java的正则表达式这个更强大的语言来指定匹配条件
逻辑运算符(And/Or/Not)
在这里插入图片描述
分组
Group By语句
Group by语句通畅会和聚合函数一起使用,按照一个或者多个列队结果进行分组,然后对每个组执行聚合操作.
Having语句
1.having和where不同点
(1) where后面不能写分组聚合函数,而having后面可以使用分组聚合函数
(2)having只用于group by分组统计语句
Join语句
等值join
Hive支持通畅的SQLjoin语句,但是只支持等值连接,(这个版本)支持非等值连接
表的别名
1.好处 : ①使用别名可以简化查询 ②使用表名前缀可以提高执行效率

内连接
只有进行连接的连个表中都存在与连接条件相匹配的数据才会被保留下来
左外连接
join操作符左边表中符合where子句的所有记录将会被返回
右外连接
join操作符右边表中符合where字句的所有记录将会被返回
满外连接
将会返回所有表中符合where语句条件的所有记录,如果任以表的指定字段没有符合条件的值的话,那么久使用null值代替
多表连接
注意:连接n个表,至少需要n-1个连接条件

笛卡尔积
1.笛卡尔积会在下面条件下产生
(1)省略连接条件
(2)连接条件无效
(3)所有表中的所有行互相连接

排序
全局排序
order by :全局排序,只有一个reduce
1)使用 order by字句排序
ASC(ascend):升序(默认)
DESC(descend):降序
2)order by 字句用在select语句的结尾

每个reduce内部排序(Sort By)
Sort By：对于大规模的数据集order by的效率非常低。在很多情况下，并不需要全局排序，此时可以使用sort by。
Sort by为每个reducer产生一个排序文件。每个Reducer内部进行排序，对全局结果集来说不是排序

分区(Distribute by)

Distribute By：在有些情况下，我们需要控制某个特定行应该到哪个reducer，通常是为了进行后续的聚集操作。distribute by 子句可以做这件事。distribute by类似MR中partition（自定义分区），进行分区，结合sort by使用。
对于distribute by进行测试，一定要分配多reduce进行处理，否则无法看到distribute by的效果。

注意：
distribute by的分区规则是根据分区字段的hash码与reduce的个数进行模除后，余数相同的分到一个区。
Hive要求DISTRIBUTE BY语句要写在SORT BY语句之前。
演示完以后mapreduce.job.reduces的值要设置回-1，否则下面分区or分桶表load跑mr的时候有可能会报错

分区排序(Cluster By)

当distribute by和sort by字段相同时，可以使用cluster by方式。
cluster by除了具有distribute by的功能外还兼具sort by的功能。但是排序只能是升序排序，不能指定排序规则为ASC或者DESC。

有态度的程序猿

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
大数据之Hive之DML(数据操作语言)

1.数据导入1.1load装载数据load data [local] inpath '数据的path' [overwrite] into table student [partition (partcol1=val1, ...)];-- 测试表create table student (id int,name string) row format delimited fields terminated by '\t';-- load 数据之追加数据本地导入是复制进去的load data
复制链接

扫一扫