Hive多表关联

SailorGing

已于 2023-02-28 12:52:18 修改

阅读量1.1k

点赞数

分类专栏： Hive数仓工具文章标签： hive hadoop 大数据数据仓库

于 2023-02-28 12:39:37 首次发布

本文链接：https://blog.csdn.net/mrgui008/article/details/129258283

版权

Hive数仓工具专栏收录该内容

6 篇文章 0 订阅

订阅专栏

文章目录

1）示例
2）总结
3）优化

1）示例

表信息：部门表、员工表、位置表。字段信息如下：

-- 部门表信息
create external table if not exists department(
    deptno bigint comment '部门编号',
    dname  string comment '部门名',
    loc    bigint comment '部门位置'
)comment '部门表'
row format delimited fields terminated by '\t'
location '/test/department';   -- 外部表，通常数据不在表路径下，因此需要指定数据位置。

-- 员工表信息
create external table if not exists employee(
    empno    bigint comment '员工编号',
    ename    string comment '员工姓名',
    job      string comment '工种',
    mgr      bigint comment '部门领导',
    hiredate string comment '入职时间',
    sal      decimal(16,2) comment '薪资',   -- 注意资金问题字段选型：double|decimal
    comm     decimal(16,2) comment '奖金',
    deptno   bigint comment '部门编号'
)comment '员工表'
row format delimited fields terminated by '\t'
location '/test/employee';

create external table if not exists location(
	loc int,
	loc_name string
)
row format delimited fields terminated by '\t'
location '/test/location';

多表连接查询

SELECT e.ename, d.dname, l.loc_name
FROM   employee e 
JOIN   department d
	ON     d.deptno = e.deptno 
JOIN   location l
	ON     d.loc = l.loc;

底层执行过程解析：

先启动一个MapReduce job对表e和表d进行连接操作。
然后再启动一个MapReduce job将第一个MapReduce job的输出和表l进行连接操作。
虽然有两个MapReduce job在执行，但是此时是串行执行的，不是并行执行的。因为，第3张表要与表1-2的连接结果进行连接

2）总结

连接 n个表，至少需要n-1个连接条件。例如：连接三个表，至少需要两个连接条件。
多表连接时，省略连接条件、连接条件无效、所有表中的所有行互相连接都会产生笛卡尔积。
大多数情况下，Hive会对每对join连接对象启动一个MapReduce任务。连接条件相同时，只启动一个MapReduce任务。
先执行表d和表e连接，然后表d和表l连接。Hive总是按照从左到右的顺序执行的。
一条SQL语句，底层由MapReduce执行时，是串行执行的，不是并行执行的。

3）优化

当对3个或者更多表进行join连接时，如果每个on子句都使用相同的连接键，那么只会产生一个MapReduce job。
串行转并行思想。根据需求环境尽可能减少一次多个表连接。如2n个表连接，可以采用2组n个表连接，然后合并连接。
优化计算引擎。Hive on Spark替换Hive on MapReduce，Spark基于内存计算引擎替换MapReduce数据落盘计算引擎。

SailorGing

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
Hive多表关联

Hive多表连接
复制链接

扫一扫

专栏目录