问题:card 表的 card_number 字段忘记加唯一约束,造成大量数据重复,需要去重。
1 测试数据准备
创建表1
2
3
4
5
6CREATE TABLE `card` (
`card_id` bigint(20) NOT NULL AUTO_INCREMENT COMMENT 'ID',
`card_number` varchar(100) DEFAULT NULL COMMENT '卡号',
`other_field` varchar(100) DEFAULT NULL COMMENT '其它字段',
PRIMARY KEY (`card_id`)
) ENGINE=InnoDB AUTO_INCREMENT=142665 DEFAULT CHARSET=utf8;
新建存储过程,准备 10 万条数据1
2
3
4
5
6
7
8
9
10
11
12
13
14
15DROP PROCEDURE IF EXISTS proc1;
DELIMITER $$
SET AUTOCOMMIT = 0$$
CREATE PROCEDURE proc1()
BEGIN
DECLARE v_cnt DECIMAL (10) DEFAULT 0 ;
dd:LOOP
INSERT INTO card (card_number, other_field) VALUES (UUID(), UUID());
COMMIT;
SET v_cnt = v_cnt+1 ;
IF v_cnt = 100000 THEN LEAVE dd;
END IF;
END LOOP dd ;
END;$$
DELIMITER ;
调用存储过程1call proc1;
执行完存储过程后检查数据条数1select count(card_id) from card;
构建 1 万条卡号重复的数据(大概10%重复)1
2
3
4insert into card(card_number) select card_number from card limit 5000;
commit;
insert into card(card_number) select card_number from card limit 5000;
commit;
2 检测重复数据
查询重复的卡号及其重复条数1select card_number, count(card_number) as count from card group by card_number having count(card_number) > 1;
3 数据去重
3.1 方案一:离线执行
离线执行:在执行去重数据期间,其它程序不对 card 表做增删改操作。
思路:将所有数据导入一张结构与其一样的空的临时表,但临时表需先建立重复字段的唯一索引,然后导入数据时,忽略错误,达到去重的效果。最后修改表名即可。
新建临时表 card_temp1
2
3
4
5
6CREATE TABLE `card_temp` (
`card_id` bigint(20) NOT NULL AUTO_INCREMENT COMMENT 'ID',
`card_number` varchar(100) DEFAULT NULL COMMENT '卡号',
`other_field` varchar(100) DEFAULT NULL COMMENT '其它字段',
PRIMARY KEY (`card_id`)
) ENGINE=InnoDB AUTO_INCREMENT=142665 DEFAULT CHARSET=utf8;
在临时表中为卡号添加唯一索引1alter table card_temp add unique index(card_number);
将 card 表的数据导入 card_temp 表,但重复的则忽略1
2insert ignore into card_temp(card_id, card_number, other_field) select card_id, card_number, other_field from card order by null;
commit;
TIPS:实际测试,50万条数据,40秒左右。若数据量达百万级以上,建议分批执行。
检查临时表数据条数1select count(card_id) from card_temp;
修改表名称1
2alter table card rename to card_old;
alter table card_temp rename to card;
确认无误后删除旧数据表1drop table card_old;
3.2 方案二:在线执行
在线执行:在执行去重数据期间,其它程序可以对 card 表做增删改操作。
3.2.1 通过中间表
思路:将需保留的重复的数据导入一张结构与其一样的空的临时表,但临时表需先建立重复字段的唯一索引,确保数据无误,然后通过比对原始表,删除多余的重复数据。最后修改原始表的索引即可。
新建临时表 card_temp1
2
3
4
5
6CREATE TABLE `card_temp` (
`card_id` bigint(20) NOT NULL AUTO_INCREMENT COMMENT 'ID',
`card_number` varchar(100) DEFAULT NULL COMMENT '卡号',
`other_field` varchar(100) DEFAULT NULL COMMENT '其它字段',
PRIMARY KEY (`card_id`)
) ENGINE=InnoDB AUTO_INCREMENT=142665 DEFAULT CHARSET=utf8;
在临时表中为卡号添加唯一索引1alter table card_temp add unique index(card_number);
在 card 表中为 card_number 建立普通索引,为后面优化去重效率1alter table card add index (card_number);
检查语句的消耗1
2
3
4explain insert into card_temp
select b.card_id, a.card_number, a.other_field
from card a inner join (select max(card_id) card_id from card group by card_number having count(card_id)>1 order by null) b
on a.card_id = b.card_id;
将 card 表需保留的重复数据导入 card_temp 表1
2
3
4
5insert into card_temp
select b.card_id, a.card_number, a.other_field
from card a inner join (select max(card_id) card_id from card group by card_number having count(card_id)>1 order by null) b
on a.card_id = b.card_id;
commit;
检查删除语句的消耗1explain delete a.* from card a, card_temp b where b.card_number = a.card_number and a.card_id < b.card_id;
删除 card 表重复的数据1
2delete a.* from card a, card_temp b where b.card_number = a.card_number and a.card_id < b.card_id;
commit;
TIPS:实际测试,50万条数据,20秒左右。若数据量达百万级以上,建议分批执行。
在 card 表中为 card_number 删除普通索引,并建立唯一索引,防止数据再次重复1
2alter table card drop index card_number;
alter table card add unique index(card_number);
确认无误后删除临时表1drop table card_temp;
3.2.2 直接删除
即时统计并删除1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18DELETE a
FROM
card a
JOIN (
SELECT
card_number,
count(card_number) AS count,
max(card_id) AS card_id
FROM
card
GROUP BY
card_number
HAVING
count(card_number) > 1
) b ON a.card_number = b.card_number
WHERE
a.card_id < b.card_id;
commit;