hive中order by 语句跟mysql order by 语句的区别

笔者使用hive 语句的过程中发现一些问题,故做简单分析

 

1.问题起因

 hive 查询语句: 

主要逻辑:A表LEFT JOIN B表 然后按照A表中某个字段排序后取指定limit的数据;

错误的执行语句如下:

select A.id, A.tag,A.is_jingpin,A.is_time_sensitive,A.first_publish_time from(
	select data['id'] as id,
	data['intf_qiwen_vertical_level1_category_tag_name'] as tag,
	data['is_video_jingpin_mark'] as is_jingpin,
	data['is_small_video_time_sensitive'] as is_time_sensitive,
	data['first_publish_time']as first_publish_time,
	data['uploader_vertical_iqiyihao_internal_level']as internal_level
	from recindexing_sv_corpus_l1_fields
	where cast(data['business_type'] as int)& 2 > 0
	and(data['is_video_jingpin_mark']='false' or data['is_video_jingpin_mark']='')
	and cast(data['create_time']as bigint)>=(unix_timestamp(current_timestamp,'yyyy-MM-dd HH:mm:ss')-90*24*60*60)
	and (not data['intf_qiwen_vertical_category_tag_name_v2'] like '%无聊日常%')
	and (not data['intf_qiwen_vertical_category_tag_name_v2']like '%儿童%')
	and cast(data['intf_qiwen_vertical_content_pointless_score'] as double)>=0
	and cast(data['intf_qiwen_vertical_content_pointless_score'] as double)<=0.1
	and cast(nvl(data['intf_qiwen_vertical_cover_quality_score_v2'],0)as double)>=0.85
	and cast(nvl(data['intf_qiwen_vertical_content_quality_score_v2'],0)as double)>=0.95
	and cast(nvl(data['intf_qiwen_vertical_cover_quality_resolution_score'],0)as double)>=0.9
)A 
left outer join (
select id from sv_boutique_candidate_produced)B
on A.id = B.id
where B.id is null 
ORDER by A.internal_level desc limit 2500;

 错误信息:

FAILED: SemanticException [Error 10004]: Line 25:28 Invalid table alias or column reference 'internal_level': (possible column names are: id, tag, is_jingpin, is_time_sensitive, first_publish_time)

 修正后的hive 语句如下:

select A.id, A.tag,A.is_jingpin,A.is_time_sensitive,A.first_publish_time,A.internal_level from(
	select data['id'] as id,
	data['intf_qiwen_vertical_level1_category_tag_name'] as tag,
	data['is_video_jingpin_mark'] as is_jingpin,
	data['is_small_video_time_sensitive'] as is_time_sensitive,
	data['first_publish_time']as first_publish_time,
	data['uploader_vertical_iqiyihao_internal_level']as internal_level
	from recindexing_sv_corpus_l1_fields
	where cast(data['business_type'] as int)& 2 > 0
	and(data['is_video_jingpin_mark']='false' or data['is_video_jingpin_mark']='')
	and cast(data['create_time']as bigint)>=(unix_timestamp(current_timestamp,'yyyy-MM-dd HH:mm:ss')-90*24*60*60)
	and (not data['intf_qiwen_vertical_category_tag_name_v2'] like '%无聊日常%')
	and (not data['intf_qiwen_vertical_category_tag_name_v2']like '%儿童%')
	and cast(data['intf_qiwen_vertical_content_pointless_score'] as double)>=0
	and cast(data['intf_qiwen_vertical_content_pointless_score'] as double)<=0.1
	and cast(nvl(data['intf_qiwen_vertical_cover_quality_score_v2'],0)as double)>=0.85
	and cast(nvl(data['intf_qiwen_vertical_content_quality_score_v2'],0)as double)>=0.95
	and cast(nvl(data['intf_qiwen_vertical_cover_quality_resolution_score'],0)as double)>=0.9
)A 
left outer join (
select id from sv_boutique_candidate_produced)B
on A.id = B.id
where B.id is null 
ORDER by A.internal_level desc limit 2500;

主要修改是在select 语句中加入了A.internal_level

2 问题分析

使用mysql 进行类似的查询

select A.feed_id,A.name,A.impression_goal from (
	select feed_id,name,duration,impression_goal, create_time as ct
	from  guaranteed_delivery_plan_conf)A 
    left join (
		select feed_id,conf_id from guaranteed_delivery_plan_item)B 
	on A.feed_id = B.feed_id 
where B.feed_id >3809986964540500 
order by A.ct desc 
limit 10;

mysql 类似语句可以正常执行,但是hive 语句缺无法正常运行;

3 分析缘由

hive ORDER BY 会做产出一个reducer 做全局排序故需要先查询出相应的字段

继续分析hive的order by,sort by,distribute by,cluster by

3.1order by :

select col,col2...
from tableName
where condition
order by col1,col2 [asc|desc

(1):order by后面可以有多列进行排序,默认按字典排序。

(2):order by为全局排序。

(3):order by需要reduce操作,且只有一个reduce,无法配置(因为多个reduce无法完成全局排序)。

图示:

3.2 sort by

sort by不是全局排序,其在数据进入reducer前完成排序,因此,如果用sort by进行排序,并且设置mapred.reduce.tasks>1,则sort by只会保证每个reducer的输出有序,并不保证全局有序。sort by不同于order by,它不受hive.mapred.mode属性的影响,sort by的数据只能保证在同一个reduce中的数据可以按指定字段排序

图示:

3.3 distribute by

distribute by是控制在map端如何拆分数据给reduce端的。hive会根据distribute by后面列,对应reduce的个数进行分发,默认是采用hash算法

图片描述

3.4 cluster by

cluster by除了具有distribute by的功能外还兼具sort by的功能。但是排序只能是倒序,不能指定排序规则为ASC或者DESC。

图示:

参考文章:

1.https://zhuanlan.zhihu.com/p/93747613

2.https://blog.csdn.net/lzm1340458776/article/details/43306115

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值