今天发现一个问题,将数据文件导入hive,然后校验数据,具体如下:
step1:
select count(1),count(distinct sys_code,order_no,hotel_seq) from dw_source.oc_hotel_order where dt='20150530';
结果: 57736688 57656110
说明:sys_code,order_no,hotel_seq 存在重复的,分组不唯一!
推论 select sys_code,order_no,hotel_seq,count(1) from dw_source.oc_hotel_order where dt='20150530' group by sys_code,order_no,hotel_seq having count(1) > 1 应该有个结果集!
step2:
select count(distinct concat(sys_code,order_no,hotel_seq)) from dw_source.oc_hotel_order where dt='20150530';
结果: 57656110
说明: 二次验证 step1
step3
select count(1) from
(
select sys_code,order_no,hotel_seq,count(1) from dw_source.oc_hotel_order where dt='20150530' group by sys_code,order_no,hotel_seq having count(1) > 1
) t;
结果: 0
说明:这说明step1的推论不成立
如果推论不成立,就说明 sys_code,order_no,hotel_seq 不应该存在重复的,和step1的结论又矛盾。
这是为什么???
猜了N久是什么原因,也试了很多方法测。
最后发现是空值 NULL 造成的问题,然后试了一把
create table temp.test_distinct
as
select sys_code,order_no,hotel_seq
from dw_source.oc_hotel_order
where dt='20150530'
and hotel_seq is null
limit 10;
hive> select * from temp.test_distinct;
OK
1001 1337418073974 NULL
1001 1322452118 NULL
1001 1321665107 NULL
1001 1320940345 NULL
1001 1320654060 NULL
1001 1320091070 NULL
1001 100152004170 NULL
1001 100148616269 NULL
1001 100141108024 NULL
1001 100136889859 NULL
select count(1),count(distinct sys_code,order_no,hotel_seq) from temp.test_distinct;
Total MapReduce CPU Time Spent: 2 seconds 620 msec
OK
10 0
尼玛,果然奇迹粗线了!!! count(distinct sys_code,order_no,hotel_seq) 为 0 有木有!
最后佐证一下:
select count(1) from dw_source.oc_hotel_order where dt='20150530' and hotel_seq is null;
Total MapReduce CPU Time Spent: 4 minutes 14 seconds 350 msec
OK
57656110 + 80578 = 57736688 完全吻合
step1:
select count(1),count(distinct sys_code,order_no,hotel_seq) from dw_source.oc_hotel_order where dt='20150530';
结果: 57736688 57656110
说明:sys_code,order_no,hotel_seq 存在重复的,分组不唯一!
推论 select sys_code,order_no,hotel_seq,count(1) from dw_source.oc_hotel_order where dt='20150530' group by sys_code,order_no,hotel_seq having count(1) > 1 应该有个结果集!
step2:
select count(distinct concat(sys_code,order_no,hotel_seq)) from dw_source.oc_hotel_order where dt='20150530';
结果: 57656110
说明: 二次验证 step1
step3
select count(1) from
(
select sys_code,order_no,hotel_seq,count(1) from dw_source.oc_hotel_order where dt='20150530' group by sys_code,order_no,hotel_seq having count(1) > 1
) t;
结果: 0
说明:这说明step1的推论不成立
如果推论不成立,就说明 sys_code,order_no,hotel_seq 不应该存在重复的,和step1的结论又矛盾。
这是为什么???
猜了N久是什么原因,也试了很多方法测。
最后发现是空值 NULL 造成的问题,然后试了一把
create table temp.test_distinct
as
select sys_code,order_no,hotel_seq
from dw_source.oc_hotel_order
where dt='20150530'
and hotel_seq is null
limit 10;
hive> select * from temp.test_distinct;
OK
1001 1337418073974 NULL
1001 1322452118 NULL
1001 1321665107 NULL
1001 1320940345 NULL
1001 1320654060 NULL
1001 1320091070 NULL
1001 100152004170 NULL
1001 100148616269 NULL
1001 100141108024 NULL
1001 100136889859 NULL
select count(1),count(distinct sys_code,order_no,hotel_seq) from temp.test_distinct;
Total MapReduce CPU Time Spent: 2 seconds 620 msec
OK
10 0
尼玛,果然奇迹粗线了!!! count(distinct sys_code,order_no,hotel_seq) 为 0 有木有!
最后佐证一下:
select count(1) from dw_source.oc_hotel_order where dt='20150530' and hotel_seq is null;
Total MapReduce CPU Time Spent: 4 minutes 14 seconds 350 msec
OK
57656110 + 80578 = 57736688 完全吻合
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/28929558/viewspace-1678913/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/28929558/viewspace-1678913/