hive中distinct用法

欢迎各位关注我的个人微信公众号:赶令,将会定期更新大数据相关知识及个人心得

hive中的distinct是去重的意思,和group by在某些情况下有相同的功能

下面测试下distinct的部分功能,先创建一张测试表

create table test.trip_tmp(
id int,
user_id int,
salesman_id int,
huose_id int
);

插入模拟数据

insert into test.trip_tmp values(1, 2, 3, 3);
insert into test.trip_tmp values(1, 2, 3, 3);
insert into test.trip_tmp values(2, 2, 3, 3);
insert into test.trip_tmp values(3, 2, 3, 3);
insert into test.trip_tmp values(4, 2, 5, 3);
insert into test.trip_tmp values(6, 3, 3, 3);
insert into test.trip_tmp values(5, 4, 2, 3);
insert into test.trip_tmp values(5, 2, 3, 3);
insert into test.trip_tmp values(6, 2, 5, 3);
insert into test.trip_tmp values(5, 2, 3, 3);
insert into test.trip_tmp values(5, 2, 5, 3);

查看表的所有数据

select * from test.trip_tmp;
OK
1	2	3	3
1	2	3	3
5	2	5	3
2	2	3	3
3	2	3	3
4	2	5	3
6	3	3	3
5	4	2	3
5	2	3	3
6	2	5	3
5	2	3	3
Time taken: 0.277 seconds, Fetched: 11 row(s)

对表的所有列去重

select distinct id, user_id, salesman_id, huose_id from test.trip_tmp;

OK
1	2	3	3
2	2	3	3
3	2	3	3
4	2	5	3
5	2	3	3
5	2	5	3
5	4	2	3
6	2	5	3
6	3	3	3
Time taken: 13.142 seconds, Fetched: 9 row(s)

这样distinct后的所有列重复的数据去除了

hive中使用distinct必须在select的最前面,不能在distinct的前面加列名,否则会报错

select huose_id, distinct id, user_id, salesman_id from test.trip_tmp;
NoViableAltException(96@[80:1: selectItem : ( ( tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( expression ( ( ( KW_AS )? identifier ) | ( KW_AS
 LPAREN identifier ( COMMA identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) );])	at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA13.specialStateTransition(HiveParser_SelectClauseParser.java:4625)
	at org.antlr.runtime.DFA.predict(DFA.java:80)
	at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:1616)
	at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1177)
	at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:951)
	at org.apache.hadoop.hive.ql.parse.HiveParser.selectClause(HiveParser.java:42192)
	at org.apache.hadoop.hive.ql.parse.HiveParser.atomSelectStatement(HiveParser.java:36852)
	at org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:37119)
	at org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:36765)
	at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:35954)
	at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:35842)
	at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:2285)
	at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1334)
	at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:208)
	at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:77)
	at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:70)
	at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:468)
	at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
	at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457)
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
	at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233)
	at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
	at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
	at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821)
	at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
	at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
FAILED: ParseException line 1:17 cannot recognize input near 'distinct' 'id' ',' in selection target

distinct也可以这样用,但和把所有的列都放一起差不多

select distinct (id, user_id, huose_id), salesman_id from test.trip_tmp;
OK
{"col1":1,"col2":2,"col3":3}	3
{"col1":2,"col2":2,"col3":3}	3
{"col1":3,"col2":2,"col3":3}	3
{"col1":4,"col2":2,"col3":3}	5
{"col1":5,"col2":2,"col3":3}	3
{"col1":5,"col2":2,"col3":3}	5
{"col1":5,"col2":4,"col3":3}	2
{"col1":6,"col2":2,"col3":3}	5
{"col1":6,"col2":3,"col3":3}	3
Time taken: 9.201 seconds, Fetched: 9 row(s)

distinct不能和聚合函数并列使用,否则会报错

select distinct id, user_id, salesman_id, count(huose_id) from test.trip_tmp;
FAILED: SemanticException [Error 10128]: Line 1:42 Not yet supported place for UDAF 'count'

但可以在聚合函数里面使用distinct

select count(distinct id) from test.trip_tmp;
OK
6
Time taken: 4.775 seconds, Fetched: 1 row(s)

最后,如果能用group by的就尽量使用group by,因为group by性能比distinct更好,尤其数据量大的时候能明显感觉到。

  • 17
    点赞
  • 49
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
Hive,数据倾斜是指在数据处理过程,某些键值对的数量远远超过其他键值对,导致任务无法平均分布在不同的节点上,从而降低了任务的效率。解决数据倾斜问题的一个常见方法是使用加盐。加盐是指在进行分区或者连接操作时,为倾斜的键值对添加一个随机的前缀或后缀,使其分布更均匀。通过加盐,可以将原本倾斜的数据分散到不同的分区或者连接节点上,提高任务的并行度和效率。在Hive,可以通过设置一些配置参数来实现加盐处理,例如设置hive.skewjoin.key参数来指定Join键对应的记录数超过一定值时进行倾斜分析,设置hive.optimize.skewjoin参数来启用倾斜处理,设置hive.skewjoin.mapjoin.map.tasks参数来限制倾斜处理的mapper数量,以及设置hive.skewjoin.mapjoin.min.split参数来控制倾斜最小切片大小等。此外,还可以使用动态分区和count(distinct)等操作来处理数据倾斜问题。总之,加盐是Hive解决数据倾斜问题的一种有效方法。<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* *3* [Hive数据倾斜的优化配置](https://blog.csdn.net/m0_67393593/article/details/123501621)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v92^chatsearchT3_1"}}] [.reference_item style="max-width: 50%"] - *2* [Hive数据倾斜总结](https://blog.csdn.net/wsjslient/article/details/114271826)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v92^chatsearchT3_1"}}] [.reference_item style="max-width: 50%"] [ .reference_list ]

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值