有时候,因为开发时,考虑不当,会造成数据重复。
背景:
前几天做项目的时候,有一个给ip打标签的操作,但是当数据数量多了以后,发现,同一个ip被打了两次相同的标签,导致数据显示重复,然而,已经使用了一段时间了,只能在原有数据库上删除重复数据,然后修改添加标签的逻辑了,去重操作如下:
mysql> select * from vip_port_tag;
+-------+----------------+-------+-------------------+
| id | vip | port | tag |
+-------+----------------+-------+-------------------+
| 32446 | 10.172.95.38 | 19527 | ttttttttttttttttt |
| 32447 | 111.206.52.82 | 80 | ttttttttttttttttt |
| 32448 | 221.181.72.212 | 443 | ttttttttttttttttt |
| 32449 | 182.118.31.28 | 2229 | ttttttttttttttttt |
| 32450 | 10.172.95.38 | 19527 | ttttttttttttttttt |
| 32451 | 111.206.52.82 | 80 | ttttttttttttttttt |
+-------+----------------+-------+-------------------+
6 rows in set
mysql>
可以很明显的看到,vip为10.172.95.38,port为19527还有vip为111.206.52.82,port为90的被连续两次打了相同的标签。
去重思想,根据同一tag,还有vip,port,分组,然后统计每组的个数,大于1的,即为有重复的。
mysql> select id,count(*) as cnt,tag,vip,port from vip_port_tag group by tag,vip,port;
+-------+-----+-------------------+----------------+-------+
| id | cnt | tag | vip | port |
+-------+-----+-------------------+----------------+-------+
| 32446 | 2 | ttttttttttttttttt | 10.172.95.38 | 19527 |
| 32447 | 2 | ttttttttttttttttt | 111.206.52.82 | 80 |
| 32449 | 1 | ttttttttttttttttt | 182.118.31.28 | 2229 |
| 32448 | 1 | ttttttttttttttttt | 221.181.72.212 | 443 |
+-------+-----+-------------------+----------------+-------+
4 rows in set
cnt那一列,大于1的就为重复列
mysql> select id,count(*) as cnt,tag,vip,port from vip_port_tag where id>32445 group by tag,vip,port having cnt>1;
+-------+-----+-------------------+---------------+-------+
| id | cnt | tag | vip | port |
+-------+-----+-------------------+---------------+-------+
| 32446 | 2 | ttttttttttttttttt | 10.172.95.38 | 19527 |
| 32447 | 2 | ttttttttttttttttt | 111.206.52.82 | 80 |
+-------+-----+-------------------+---------------+-------+
2 rows in set
检验
mysql> select * from vip_port_tag where vip="10.172.95.38" and port="19527";
+-------+--------------+-------+-------------------+
| id | vip | port | tag |
+-------+--------------+-------+-------------------+
| 32446 | 10.172.95.38 | 19527 | ttttttttttttttttt |
| 32450 | 10.172.95.38 | 19527 | ttttttttttttttttt |
+-------+--------------+-------+-------------------+
3 rows in set
然后就可以考虑删除哪一项,比如保留id较小的那个,则选出id最大的那一项
mysql> select max(id) as id,count(*) as cnt,tag,vip,port from vip_port_tag where id>32445 group by tag,vip,port having cnt>1;
+-------+-----+-------------------+---------------+-------+
| id | cnt | tag | vip | port |
+-------+-----+-------------------+---------------+-------+
| 32450 | 2 | ttttttttttttttttt | 10.172.95.38 | 19527 |
| 32451 | 2 | ttttttttttttttttt | 111.206.52.82 | 80 |
+-------+-----+-------------------+---------------+-------+
2 rows in set
然后根据id执行删除操作即可
mysql> delete from vip_port_tag where id in (select id from (select max(id) as id,count(*) as cnt,tag,vip,port from vip_port_tag group by tag,vip,port having cnt>1) as t);
Query OK, 2 rows affected
再查看原数据
mysql> select * from vip_port_tag;
+-------+----------------+-------+-------------------+
| id | vip | port | tag |
+-------+----------------+-------+-------------------+
| 32446 | 10.172.95.38 | 19527 | ttttttttttttttttt |
| 32447 | 111.206.52.82 | 80 | ttttttttttttttttt |
| 32448 | 221.181.72.212 | 443 | ttttttttttttttttt |
| 32449 | 182.118.31.28 | 2229 | ttttttttttttttttt |
+-------+----------------+-------+-------------------+
4 rows in set
OK,重复数据,就这么被干掉了。
知识点总结
- group by语句,可以为多个字段
- having 选出符合条件的分组
- count 统计分组个数
- max 选出分组中id最大的那个
- 临时表