mysql 列去重复数据库_MySQL数据库行去重复和列去重复

数据库表中数据行去重复

0.起因

我在创建唯一索引时,提示有重复数据,创建失败。于是,我得先去重,然后再创建唯一索引。

1.建表

CREATE TABLE `demo_table` (

`id` int(10) unsigned NOT NULL,

`name` char(255) NOT NULL,

`email` char(255) NOT NULL,

PRIMARY KEY (`id`)

) ENGINE=InnoDB DEFAULT CHARSET=utf8;

2.插入数据,并尝试建立唯一索引

INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('1', 'u1', 'u1@email.com');

INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('2', 'u2', 'u2@email.com');

INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('3', 'u3', 'u3@email.com');

INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('4', 'u4', 'u4@email.com');

INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('5', 'u5', 'u5@email.com');

INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('6', 'u6', 'u6@email.com');

INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('7', 'u7', 'u7@email.com');

INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('8', 'u8', 'u8@email.com');

INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('9', 'u9', 'u9@email.com');

INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('10', 'u10', 'u10@email.com');

INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('11', 'u11', 'u11@email.com');

INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('12', 'u12', 'u12@email.com');

INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('13', 'u13', 'u13@email.com');

INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('14', 'u14', 'u14@email.com');

INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('15', 'u15', 'u15@email.com');

INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('16', 'u16', 'u16@email.com');

INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('17', 'u17', 'u17@email.com');

INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('18', 'u18', 'u18@email.com');

INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('19', 'u19', 'u19@email.com');

INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('20', 'u20', 'u20@email.com');

INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('21', 'u21', 'u1@email.com');

INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('22', 'u22', 'u2@email.com');

INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('23', 'u23', 'u3@email.com');

INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('24', 'u24', 'u4@email.com');

INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('25', 'u25', 'u5@email.com');

INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('26', 'u26', 'u1@email.com');

INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('27', 'u27', 'u2@email.com');

INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('28', 'u28', 'u3@email.com');

INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('29', 'u29', 'u1@email.com');

INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('30', 'u30', 'u2@email.com');

创建唯一索引

CREATE UNIQUE INDEX idx_email ON demo_table(email);

执行结果:

ERROR 1062 (23000): Duplicate entry 'u2@email.com' for key 'idx_email'

提示有重复数据,导致创建唯一索引失败。

3.看看我们插入的数据

select * from demo_table;

执行结果:

+----+------+---------------+

| id | name | email |

+----+------+---------------+

| 1 | u1 | u1@email.com |

| 2 | u2 | u2@email.com |

| 3 | u3 | u3@email.com |

| 4 | u4 | u4@email.com |

| 5 | u5 | u5@email.com |

| 6 | u6 | u6@email.com |

| 7 | u7 | u7@email.com |

| 8 | u8 | u8@email.com |

| 9 | u9 | u9@email.com |

| 10 | u10 | u10@email.com |

| 11 | u11 | u11@email.com |

| 12 | u12 | u12@email.com |

| 13 | u13 | u13@email.com |

| 14 | u14 | u14@email.com |

| 15 | u15 | u15@email.com |

| 16 | u16 | u16@email.com |

| 17 | u17 | u17@email.com |

| 18 | u18 | u18@email.com |

| 19 | u19 | u19@email.com |

| 20 | u20 | u20@email.com |

| 21 | u21 | u1@email.com |

| 22 | u22 | u2@email.com |

| 23 | u23 | u3@email.com |

| 24 | u24 | u4@email.com |

| 25 | u25 | u5@email.com |

| 26 | u26 | u1@email.com |

| 27 | u27 | u2@email.com |

| 28 | u28 | u3@email.com |

| 29 | u29 | u1@email.com |

| 30 | u30 | u2@email.com |

+----+------+---------------+

30 rows in set (0.00 sec)

可以看到,我们插入的email字段的确有重复数据。当然,我们是故意的。

4.使用group by看看email字段完全不重复的数据。为了让结果看起来比较顺眼,我们使用order by对结果排一下序,这个不是必选项啦。

select * from demo_table group by email order by id;

执行结果:

+----+------+---------------+

| id | name | email |

+----+------+---------------+

| 1 | u1 | u1@email.com |

| 2 | u2 | u2@email.com |

| 3 | u3 | u3@email.com |

| 4 | u4 | u4@email.com |

| 5 | u5 | u5@email.com |

| 6 | u6 | u6@email.com |

| 7 | u7 | u7@email.com |

| 8 | u8 | u8@email.com |

| 9 | u9 | u9@email.com |

| 10 | u10 | u10@email.com |

| 11 | u11 | u11@email.com |

| 12 | u12 | u12@email.com |

| 13 | u13 | u13@email.com |

| 14 | u14 | u14@email.com |

| 15 | u15 | u15@email.com |

| 16 | u16 | u16@email.com |

| 17 | u17 | u17@email.com |

| 18 | u18 | u18@email.com |

| 19 | u19 | u19@email.com |

| 20 | u20 | u20@email.com |

+----+------+---------------+

20 rows in set (0.00 sec)

可以发现,表中原本有30条记录,group by得出的结果是只有20条记录,这就说明有10条重复记录。

不相信?好吧,我们再使用having看看吧。group by后,分组记录条数是多少就表示该字段存在多少条。大于1就表示出现了重复。

select *,count(id) from demo_table group by email having count(id)>1 order by id;

执行结果:

+----+------+--------------+-----------+

| id | name | email | count(id) |

+----+------+--------------+-----------+

| 1 | u1 | u1@email.com | 4 |

| 2 | u2 | u2@email.com | 4 |

| 3 | u3 | u3@email.com | 3 |

| 4 | u4 | u4@email.com | 2 |

| 5 | u5 | u5@email.com | 2 |

+----+------+--------------+-----------+

5 rows in set (0.00 sec)

这下子,我们不仅仅知道重复记录的条数是10条,而且还知道是哪些email出现了重复,并且出现了多次数。

还不相信?好吧,那你自己数吧

5.删除重复数据

我们只保留id最小的数据,其他重复的都给删掉。

delete from demo_table where id not in (select min(id) from demo_table group by email);

执行结果:

ERROR 1093 (HY000): You can't specify target table 'demo_table' for update in FROM clause

貌似是MySQL有保护机制,不可以这样操作。

变通方式:

delete from demo_table where id not in ( select min_id from ( select min(id) as min_id from demo_table group by email ) as tmp );

题外话:

如果不是纯使用SQL方式处理,而是通过编程处理,可以把select min(id) from demo_table group by email查询出来的id结果集保存在变量中,然后再执行delete from demo_table where id not id结果集。有一点需要注意的是,可能id结果集很大很大,保存在变量中会吃内存。

6.把需要保留的数据id存在临时表中,然后删除重复数据。

create table tmp_table as select min(id) from demo_table group by email;

执行结果:

Query OK, 20 rows affected (0.43 sec)

Records: 20 Duplicates: 0 Warnings: 0

看看我们创建的临时表:

select * from tmp_table;

执行结果:

+---------+

| min(id) |

+---------+

| 10 |

| 11 |

| 12 |

| 13 |

| 14 |

| 15 |

| 16 |

| 17 |

| 18 |

| 19 |

| 1 |

| 20 |

| 2 |

| 3 |

| 4 |

| 5 |

| 6 |

| 7 |

| 8 |

| 9 |

+---------+

20 rows in set (0.00 sec)

这就是我们要保留的记录的id列表。

删除重复数据:

delete from demo_table where id not in (select * from tmp_table);

执行结果:

Query OK, 10 rows affected (0.05 sec)

让我们再看一眼demo_table表吧:

select * from demo_table;

执行结果:

+----+------+---------------+

| id | name | email |

+----+------+---------------+

| 1 | u1 | u1@email.com |

| 2 | u2 | u2@email.com |

| 3 | u3 | u3@email.com |

| 4 | u4 | u4@email.com |

| 5 | u5 | u5@email.com |

| 6 | u6 | u6@email.com |

| 7 | u7 | u7@email.com |

| 8 | u8 | u8@email.com |

| 9 | u9 | u9@email.com |

| 10 | u10 | u10@email.com |

| 11 | u11 | u11@email.com |

| 12 | u12 | u12@email.com |

| 13 | u13 | u13@email.com |

| 14 | u14 | u14@email.com |

| 15 | u15 | u15@email.com |

| 16 | u16 | u16@email.com |

| 17 | u17 | u17@email.com |

| 18 | u18 | u18@email.com |

| 19 | u19 | u19@email.com |

| 20 | u20 | u20@email.com |

+----+------+---------------+

20 rows in set (0.00 sec)

好的,重复的数据已经被我们成功干掉了。大功告成。等等,我们还得把那个已经没有用处的临时表给删除。貌似有点卸磨杀驴的感觉啊。

drop table tmp_table;

执行结果:

Query OK, 0 rows affected (0.12 sec)

7.总结一下吧

其实就三条SQL语句:

create table tmp_table as select min(id) from demo_table group by email;

delete from demo_table where id not in (select * from tmp_table);

drop table tmp_table;

可能有童鞋要问,如果要去重的不是一个字段,而是多个怎么办?比如,某2条记录的email相同并且name也相同,就算为重复。

处理方式:

group by email,name

再说说数据列去重复吧

1.故意插入一些列重复的数据

INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('31', '31@email.com', '31@email.com');

INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('32', '32@email.com', '32@email.com');

INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('33', '33@email.com', '33@email.com');

INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('34', '34@email.com', '34@email.com');

INSERT INTO `test`.`demo_table` (`id`, `name`, `email`) VALUES ('35', '35@email.com', '35@email.com');

查找表中name字段和email字段相同的数据:

select * from demo_table as a, demo_table as b where a.id=b.id and a.name=b.email;

执行结果:

+----+--------------+--------------+----+--------------+--------------+

| id | name | email | id | name | email |

+----+--------------+--------------+----+--------------+--------------+

| 31 | 31@email.com | 31@email.com | 31 | 31@email.com | 31@email.com |

| 32 | 32@email.com | 32@email.com | 32 | 32@email.com | 32@email.com |

| 33 | 33@email.com | 33@email.com | 33 | 33@email.com | 33@email.com |

| 34 | 34@email.com | 34@email.com | 34 | 34@email.com | 34@email.com |

| 35 | 35@email.com | 35@email.com | 35 | 35@email.com | 35@email.com |

+----+--------------+--------------+----+--------------+--------------+

5 rows in set (0.00 sec)

重复记录的id都已经找出来了,删除就参考上述方式处理了。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值