Hadoop之hive第三天（表连接）

最新推荐文章于 2023-12-23 21:06:34 发布

明檬初见

最新推荐文章于 2023-12-23 21:06:34 发布

阅读量334

点赞数 1

本文链接：https://blog.csdn.net/TylerPY/article/details/101619602

版权

创建个表，加载本地数据。

hive> create table t_a(id string,num int)  row format delimited fields terminated by ',';
hive> load data local inpath '/home/a.txt' into table t_a;

在创建个b表，加载本地数据

hive> create table t_b like t_a;
OK
Time taken: 0.453 seconds
hive> load data local inpath '/home/b.txt' into table t_b;

内连接
交集。
Join 与 inner join 结果一样

hive> select a.*,b.* from t_a a join t_b b where a.id=b.id;

hive> select a.*,b.* from t_a a inner join t_b b on a.id=b.id;

在这里插入图片描述
左连接
左表全部保留，右表没有的用null 表示。
left outer join 与left join 结果一样

hive> select a.*,b.* from t_a a left join t_b b on a.id=b.id;

hive> select a.*,b.* from t_a a left outer join t_b b on a.id=b.id;

在这里插入图片描述
右连接
right outer join 与 right join结果一样

hive> select a.*,b.* from t_a a right join t_b b on a.id=b.id;

hive> select a.*,b.* from t_a a right outer join t_b b on a.id=b.id;

在这里插入图片描述

全集
在hive里有full outer join的mysql没有

hive> select a.*,b.* from t_a a full outer join t_b b on a.id=b.id;

在这里插入图片描述
笛卡尔积
两表关联，把左表的列和右表的列通过笛卡尔积的形式表达出来。

hive> select * from t_a jion t_b;

在这里插入图片描述

左表独有

hive> select a.*,b.* from t_a a left outer join t_b b on a.id=b.id where b.id is null;

在这里插入图片描述

右表独有
两表关联，查询右表独有的数据。

hive> select a.*,b.* from t_a a right join t_b b on a.id=b.id where a.id is null;

在这里插入图片描述
半连接
left semi join 可以提高exist | in 这种查询需求的效率
查询中，无法取到右表的数据

hive> select a.* from t_a a left semi join t_b b on a.id=b.id;

执行以下会报错

hive> select a.*,b.* from t_a a left semi join t_b b on a.id=b.id;
FAILED: SemanticException [Error 10009]: Line 1:11 Invalid table alias 'b'

所以叫半连接，等同于以下代码

hive> select a.* from t_a a join t_b b on a.id=b.id;

这样效率会更高，直接join会产生笛卡尔积，在过滤满足条件的数据，这个left semi join直接过滤需要的数据
老版本中，不支持非等值的join
在新版中：1.2.0后，都支持非等值join，不过写法应该如下：
select a.,b. from t_a a,t_b b where a.id>b.id;
不支持的语法： select a.,b. from t_a a join t_b b on a.id>b.id;

分桶表
将数据按照指定的字段进行分成多个桶中去，就是将数据按照字段进行划分，可以将数据按照字段划分到多个文件当中去
开启hive的桶表功能

set hive.enforce.bucketing=true;

设置reduce的个数

set mapreduce.job.reduces=3;

创建通表

hive> create table course (c_id string,c_name string,t_id string) clustered by(c_id) into 3 buckets row format delimited fields terminated by '\t';

桶表的数据加载，由于通表的数据加载通过hdfs dfs -put文件或者通过load data均不好使，只能通过insert overwrite
创建普通表，并通过insert overwrite的方式将普通表的数据通过查询的方式加载到桶表当中去
创建普通表：

hive> create table course_common (c_id string,c_name string,t_id string) row format delimited fields terminated by '\t';

普通表中加载数据

load data local inpath '/home/course.csv' into table course_common;
insert overwrite course select * from course_common;

GROUP BY语句
创建分区表语法

hive> create table score(s_id string,c_id string, s_score int) partitioned by (month string) row format delimited fields terminated by '\t';

加载数据到分区表中

hive> load data local inpath '/home/score.csv' into table score partition (month='201909');

GROUP BY语句通常会和聚合函数一起使用，按照一个或者多个列队结果进行分组，然后对每个组执行聚合操作。
案例实操：
（1）计算每个学生的平均分数

hive> select s_id ,avg(s_score) from score group by s_id;

注意group by的字段，必须是select后面的字段，select后面的字段不能比group by的字段多
（2）计算每个学生最高成绩

hive> select s_id ,max(s_score) from score group by s_id;

HAVING语句
1）having与where不同点
（1）where针对表中的列发挥作用，查询数据；having针对查询结果中的列发挥作用，筛选数据。
（2）where后面不能写分组函数，而having后面可以使用分组函数。
（3）having只用于group by分组统计语句。
2）案例实操：
求每个学生的平均分数

hive> select s_id ,avg(s_score) from score group by s_id;

求每个学生平均分数大于85的人

hive> select s_id ,avg(s_score) avgscore from score group by s_id having avgscore > 85;

JOIN语句
等值JOIN
老版本中，不支持非等值的join
在新版中：1.2.0后，都支持非等值join，不过写法应该如下：
案例操作
（1）查询分数对应的姓名
创建学生表老师表：

hive> create table student (s_id string,s_name string,s_birth string , s_sex string ) row format delimited fields terminated by '\t';

hive> create table teacher (t_id string,t_name string) row format delimited fields terminated by '\t';

加载数据

hive> load data local inpath '/home/teacher.csv' into table teacher;
hive> load data local inpath '/home/student.csv' into table student;

SELECT s.s_id,s.s_score,stu.s_name,stu.s_birth FROM score s LEFT JOIN student stu ON s.s_id = stu.s_id;
表的别名
1）好处
（1）使用别名可以简化查询。
（2）使用表名前缀可以提高执行效率。
2）案例实操
合并老师与课程表

hive> select * from teacher t join course c on t.t_id = c.t_id;

内连接（INNER JOIN）
内连接：只有进行连接的两个表中都存在与连接条件相匹配的数据才会被保留下来。

hive> select * from teacher t inner join course c on t.t_id = c.t_id;

左外连接（LEFT OUTER JOIN）
左外连接：JOIN操作符左边表中符合WHERE子句的所有记录将会被返回。
查询老师对应的课程

hive> select * from teacher t left join course c on t.t_id = c.t_id;

右外连接（RIGHT OUTER JOIN）
右外连接：JOIN操作符右边表中符合WHERE子句的所有记录将会被返回。

hive> select * from teacher t right join course c on t.t_id = c.t_id;

满外连接（FULL OUTER JOIN）
满外连接：将会返回所有表中符合WHERE语句条件的所有记录。如果任一表的指定字段没有符合条件的值的话，那么就使用NULL值替代。

hive> select * from teacher t full join course c on t.t_id = c.t_id ;

多表连接
注意：连接 n个表，至少需要n-1个连接条件。例如：连接三个表，至少需要两个连接条件。
多表连接查询，查询老师对应的课程，以及对应的分数，对应的学生
设置表头：set hive.cli.print.header=true;

hive> select * from teacher t 
    > left join course c 
    > on t.t_id = c.t_id
    > left join score s 
    > on s.c_id = c.c_id
    > left join student stu 
    > on s.s_id = stu.s_id;

大多数情况下，Hive会对每对JOIN连接对象启动一个MapReduce任务。本例中会首先启动一个MapReduce job对表teacher和表course进行连接操作，然后会再启动一个MapReduce job将第一个MapReduce job的输出和表score;进行连接操作。
排序
全局排序（Order By）
Order By：全局排序，一个reduce
1）使用 ORDER BY 子句排序
ASC（ascend）: 升序（默认）
DESC（descend）: 降序
2）ORDER BY 子句在SELECT语句的结尾。
3）案例实操
（1）查询学生的成绩，并按照分数降序排列

hive> select * from student s left jooin score sco on s.s_id = sco.s_id order by sco.s_score desc;

（2）查询学生的成绩，并按照分数升序排列

hive> select * from student s eft jooin score sco on s.s_id = sco.s_id order by  sco.s_score asc;

按照别名排序
按照分数的平均值排序

hive> select s_id ,avg(s_score) avg from score group by s_id order by avg;

多个列排序
按照学生id和平均成绩进行排序

hive> select s_id ,avg(s_score) avg from score group by s_id order by s_id,avg;

每个MapReduce内部排序（Sort By）局部排序
Sort By：每个MapReduce内部进行排序，对全局结果集来说不是排序。
1）设置reduce个数
set mapreduce.job.reduces=3;
2）查看设置reduce个数

hive> set mapreduce.job.reduces;

3）查询成绩按照成绩降序排列

hive> select * from score sort by s_score;

将查询结果导入到文件中（按照成绩降序排列）

hive> insert overwrite local directory '/export/servers/hivedatas/sort' select * from score sort by s_score;

分区排序（DISTRIBUTE BY）
Distribute By：类似MR中partition，进行分区，结合sort by使用。
注意，Hive要求DISTRIBUTE BY语句要写在SORT BY语句之前。
对于distribute by进行测试，一定要分配多reduce进行处理，否则无法看到distribute by的效果。
案例实操：
（1）先按照学生id进行分区，再按照学生成绩进行排序。

设置reduce的个数，将我们对应的s_id划分到对应的reduce当中去
set mapreduce.job.reduces=7;
通过distribute by 进行数据的分区

hive> insert overwrite local directory '/home/sort' select * from score distribute by s_id sort by s_score;

CLUSTER BY
当distribute by和sort by字段相同时，可以使用cluster by方式。
cluster by除了具有distribute by的功能外还兼具sort by的功能。但是排序只能是倒序排序，不能指定排序规则为ASC或者DESC。
1）以下两种写法等价

hive> select * from score cluster by s_id;
hive> select * from score distribute by s_id sort by s_id;

明檬初见

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hadoop之hive第三天（表连接）

创建个表，加载本地数据。hive> create table t_a(id string,num int) row format delimited fields terminated by ',';hive> load data local inpath '/home/a.txt' into table t_a;在创建个b表，加载本地数据hive> create t...
复制链接

扫一扫