hive中order by 语句跟mysql order by 语句的区别

最新推荐文章于 2023-09-23 16:46:50 发布

ZHEXI_Coding free

最新推荐文章于 2023-09-23 16:46:50 发布

阅读量345

点赞数

分类专栏：数据开发

本文链接：https://blog.csdn.net/chy19911123/article/details/107985847

版权

数据开发专栏收录该内容

2 篇文章 0 订阅

订阅专栏

笔者使用hive 语句的过程中发现一些问题，故做简单分析

1.问题起因

hive 查询语句:

主要逻辑：A表LEFT JOIN B表然后按照A表中某个字段排序后取指定limit的数据；

错误的执行语句如下：

select A.id, A.tag,A.is_jingpin,A.is_time_sensitive,A.first_publish_time from(
	select data['id'] as id,
	data['intf_qiwen_vertical_level1_category_tag_name'] as tag,
	data['is_video_jingpin_mark'] as is_jingpin,
	data['is_small_video_time_sensitive'] as is_time_sensitive,
	data['first_publish_time']as first_publish_time,
	data['uploader_vertical_iqiyihao_internal_level']as internal_level
	from recindexing_sv_corpus_l1_fields
	where cast(data['business_type'] as int)& 2 > 0
	and(data['is_video_jingpin_mark']='false' or data['is_video_jingpin_mark']='')
	and cast(data['create_time']as bigint)>=(unix_timestamp(current_timestamp,'yyyy-MM-dd HH:mm:ss')-90*24*60*60)
	and (not data['intf_qiwen_vertical_category_tag_name_v2'] like '%无聊日常%')
	and (not data['intf_qiwen_vertical_category_tag_name_v2']like '%儿童%')
	and cast(data['intf_qiwen_vertical_content_pointless_score'] as double)>=0
	and cast(data['intf_qiwen_vertical_content_pointless_score'] as double)<=0.1
	and cast(nvl(data['intf_qiwen_vertical_cover_quality_score_v2'],0)as double)>=0.85
	and cast(nvl(data['intf_qiwen_vertical_content_quality_score_v2'],0)as double)>=0.95
	and cast(nvl(data['intf_qiwen_vertical_cover_quality_resolution_score'],0)as double)>=0.9
)A 
left outer join (
select id from sv_boutique_candidate_produced)B
on A.id = B.id
where B.id is null 
ORDER by A.internal_level desc limit 2500;

错误信息:

FAILED: SemanticException [Error 10004]: Line 25:28 Invalid table alias or column reference 'internal_level': (possible column names are: id, tag, is_jingpin, is_time_sensitive, first_publish_time)

修正后的hive 语句如下：

select A.id, A.tag,A.is_jingpin,A.is_time_sensitive,A.first_publish_time,A.internal_level from(
	select data['id'] as id,
	data['intf_qiwen_vertical_level1_category_tag_name'] as tag,
	data['is_video_jingpin_mark'] as is_jingpin,
	data['is_small_video_time_sensitive'] as is_time_sensitive,
	data['first_publish_time']as first_publish_time,
	data['uploader_vertical_iqiyihao_internal_level']as internal_level
	from recindexing_sv_corpus_l1_fields
	where cast(data['business_type'] as int)& 2 > 0
	and(data['is_video_jingpin_mark']='false' or data['is_video_jingpin_mark']='')
	and cast(data['create_time']as bigint)>=(unix_timestamp(current_timestamp,'yyyy-MM-dd HH:mm:ss')-90*24*60*60)
	and (not data['intf_qiwen_vertical_category_tag_name_v2'] like '%无聊日常%')
	and (not data['intf_qiwen_vertical_category_tag_name_v2']like '%儿童%')
	and cast(data['intf_qiwen_vertical_content_pointless_score'] as double)>=0
	and cast(data['intf_qiwen_vertical_content_pointless_score'] as double)<=0.1
	and cast(nvl(data['intf_qiwen_vertical_cover_quality_score_v2'],0)as double)>=0.85
	and cast(nvl(data['intf_qiwen_vertical_content_quality_score_v2'],0)as double)>=0.95
	and cast(nvl(data['intf_qiwen_vertical_cover_quality_resolution_score'],0)as double)>=0.9
)A 
left outer join (
select id from sv_boutique_candidate_produced)B
on A.id = B.id
where B.id is null 
ORDER by A.internal_level desc limit 2500;

主要修改是在select 语句中加入了A.internal_level

2 问题分析

使用mysql 进行类似的查询

select A.feed_id,A.name,A.impression_goal from (
	select feed_id,name,duration,impression_goal, create_time as ct
	from  guaranteed_delivery_plan_conf)A 
    left join (
		select feed_id,conf_id from guaranteed_delivery_plan_item)B 
	on A.feed_id = B.feed_id 
where B.feed_id >3809986964540500 
order by A.ct desc 
limit 10;

mysql 类似语句可以正常执行，但是hive 语句缺无法正常运行；

3 分析缘由

hive ORDER BY 会做产出一个reducer 做全局排序故需要先查询出相应的字段

继续分析hive的order by，sort by，distribute by，cluster by

3.1order by :

select col,col2...
from tableName
where condition
order by col1,col2 [asc|desc

(1)：order by后面可以有多列进行排序，默认按字典排序。

(2)：order by为全局排序。

(3)：order by需要reduce操作，且只有一个reduce，无法配置(因为多个reduce无法完成全局排序)。

图示：

3.2 sort by

sort by不是全局排序，其在数据进入reducer前完成排序，因此，如果用sort by进行排序，并且设置mapred.reduce.tasks>1，则sort by只会保证每个reducer的输出有序，并不保证全局有序。sort by不同于order by，它不受hive.mapred.mode属性的影响，sort by的数据只能保证在同一个reduce中的数据可以按指定字段排序

图示：

3.3 distribute by

distribute by是控制在map端如何拆分数据给reduce端的。hive会根据distribute by后面列，对应reduce的个数进行分发，默认是采用hash算法

图片描述

3.4 cluster by

cluster by除了具有distribute by的功能外还兼具sort by的功能。但是排序只能是倒序，不能指定排序规则为ASC或者DESC。

图示：

参考文章：

1.https://zhuanlan.zhihu.com/p/93747613

2.https://blog.csdn.net/lzm1340458776/article/details/43306115

ZHEXI_Coding free

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hive中order by 语句跟mysql order by 语句的区别

笔者使用hive 语句的过程中发现一些问题，故做简单分析1.问题起因hive 查询语句:主要逻辑：A表LEFT JOIN B表然后按照A表中某个字段排序后取指定limit的数据；错误的执行语句如下：select A.id, A.tag,A.is_jingpin,A.is_time_sensitive,A.first_publish_time from( select data['id'] as id, data['intf_qiwen_vertical_level1_cate..
复制链接

扫一扫

专栏目录