在上一篇文章中,实现了查询重复数据与删除多余重复数据的sql编写:http://blog.csdn.net/u011099093/article/details/78596034;
但是经过导入百万数据测试,查询语句直接卡住没有结果,于是又花了一天时间对sql进行优化,在借鉴了如下地址的方法后,编写出优化后的查重及去重sql:
https://yq.aliyun.com/articles/68224#15
优化查询重复数据:SELECT A,B,C,D,E FROM( SELECT A,B,C,D,E,COUNT(*) OVER(PARTITION BY A,B,C )AS tmp FROM TABLE WHERE CONDITION) t WHERE t.tmp>=2;
优化处理重复数据:
DELETE FROM TABLE WHERE ID NOT IN(SELECT ID FROM
(SELECT MIN(ID) ID,A,B,C FROM TABLE WHERE CONDITION GROUP BY A,B,C HAVING COUNT(*)>1) C)
AND ID IN( SELECT ID FROM
(SELECT ID,COUNT(*) OVER(PARTITION BY A,B,C) AS tmp FROM TABLE WHERE CONDITION)t WHERE t.tmp>=2);
对于over partition的用法我真的不是很了解,而上述地址中对重复 数据清洗讲的比较详细,有兴趣的同学可移步去看看。
<