hive 数据查询复杂SQL

最新推荐文章于 2023-04-21 16:01:11 发布

幸运小侯子

最新推荐文章于 2023-04-21 16:01:11 发布

阅读量598

点赞数

文章标签：大数据

排序和聚集

正常在数据少的情况下

直接使用order by来操作即可，因为是全排序所以要在一个reduce中完成

from records

select year,temperature

order by year asc,temperature desc;

如果数据量大，并且不需要全排序，只是需要每个reduce中的数据排序即可。如下根据year来指定（distribute by）到相同的reduce中，然后根据sort by来排序

from records

select year,temperature

distribute by year

sort by year asc,temperature desc;

当然一般如果不用指定排序默认字段是排序asc的且在同一个reduce中

from records

select year,temperature

cluster by year;

--------------------------------------------------

from records

select year,temperature

cluster by year,temperature;

MapReduce脚本

连接

内连接

Hive中的连接就是把我们查询操作根据连接条件解析成对对应的maper的输出key，value就是数据对象关联的两条记录。Reducer去处理连接查询的操作。

数据准备

/root/hcr/tmp/sample2.txt数据文件

1990ruishenh0

1992ruishenh2

1991ruishenh1

1993ruishenh3

1994ruishenh4

1995ruishenh5

1996ruishenh6

1997ruishenh7

1998ruishenh8

create table records2 (year string,namestring) row format delimited fields terminated by '\t'

loaddata local inpath'/root/hcr/tmp/sample2.txt' overwrite into tablerecords2;

joinon

select records.*,records2.*

from records join records2 on(records.year=records2.year)

在hive中的join on 操作可以多个条件连接，比如 a join b on a.id=b.aid and a.type=b.atype

select records.*,records2.*

from records join records2 on(records.year=records2.year and records.quality!=1)

hive中同样也是支持多表做连接的

selectr1.year,r2.name,r2.year,r4.y,r4.standard fromrecords2 r2 join records r1 on (r1.year=r2.year) join records4 r4 on(r4.y=r2.year);

但是执行后报错，//找问题TODO

提示到因为join子句一般把大数据的表都放到后边；

外连接

左外连接以左表为主查询，关联不到为null

select * from records r left outer joinrecords2 r2 on r.year=r2.year;

右外连接以右表为主查询，关联不到为null

select * from records r right outer joinrecords2 r2 on r.year=r2.year;

半连接

select * from records2 r left semi join records r2 on r.year=r2.year;

map 连接 /*+MAPJOIN(records2)*/

From records r join records2 r2 onr.year=r2.year

select /*+MAPJOIN(records2)*/ r2.*,r.*;

子查询

子查询是内嵌在另一个SQL语句中的SELECT语句。Hive对子查询的支持很有限。它只允许子查询出现在SELECT语句的FROM子句中。

from

(

From records r

select r.year,MAX(r.temperature)asmax_temperature

where r.temperature !=9999 and (r.quality=0or r.quality=1 or r.quality=2)

group by r.year

) mt

select mt.year,avg(mt.max_temperature)

group by mt.year ;

因为在外层查询要用到子查询的字段，所以必须赋值别名，比如上文中的mt，而且在子查询中的返回的列名中必须不能存在重复的列名。（比如不能有两个records.year,和records2.year）

视图

Hive中的数据就是一个虚拟的存在写好的sql一样，它不会物化实际。且不能向基表加载或者插入数据。

创建视图

create viewmax_records

as

select r.year,MAX(r.temperature)asmax_temperature

From records r

where r.temperature !=9999 and (r.quality=0or r.quality=1 or r.quality=2)

group by r.year ;

查询视图

Select * from max_records;

重现上边子查询操作：

select year,avg(max_temperature)

from max_records

group by year;

幸运小侯子

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hive 数据查询复杂SQL

排序和聚集正常在数据少的情况下直接使用order by来操作即可，因为是全排序所以要在一个reduce中完成from recordsselect year,temperatureorder by year asc,temperature desc;如果数据量大，并且不需要全排序，只是需要每个reduce中的数据排序即可。如下根据year来指定（distrib...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。