SQL删除重复记录

最新推荐文章于 2024-09-06 22:15:10 发布

newnazi

最新推荐文章于 2024-09-06 22:15:10 发布

阅读量890

点赞数

分类专栏：代码数据库文章标签： select sql 数据库数据

代码同时被 2 个专栏收录

123 篇文章

订阅专栏

数据库

67 篇文章

订阅专栏

http://blog.csdn.net/sqlserverdiscovery/article/details/6685860

星期五去一家公司面试SQL数据库工程师，有一份笔试题目，其中有一题是：现在有一个表t（姓名，手机号，联系方式），手机号字段存在重复记录，要求用一句SQL删除重复记录。

我想了想写了一句创建唯一索引的语句：

create unique index t_index on t(手机号)

交卷了，后来想想也不对，这样只能保证以后不能添加重复的记录，这样写会导致：当添加了重复记录时，发生错误，导致整个添加操作被回滚。实际上应该写成这样：

create unique index t_index on t(手机号) with ignore_dup_key

这样才能保证当添加重复记录时候，忽略重复记录，如果有2条记录是重复的，那么只会添加其中的一条，而且不会在执行的时候发生错误。

面试人员看了我做的题，说让我把这一题再重新做做，然后再给他看，我就重新想了2种写法。我按照第一种写法写得语句，面试人员说这种办法一看就是错误的。

第二种写法，面试人员看说：基本是对的，就是效率差点。呵呵，他说的很对，效率确实差，

下面把两种办法都写出来，测试环境SQL Server 2005：

首先建表：

create table t (ename varchar(10),mobile char(5),relation varchar(5))

insert into t values('abcd','12345','cc') --和第3个重复
insert into t values('ab', '12358','dd') --和第5个重复
insert into t values('abc', '12345','gg')
insert into t values('abde','24685','hh')
insert into t values('abdf', '12358','ag')
insert into t values('a', '54321','op')

select * into tt from t --以下的操作都在tt上进行

1.先说第二种写法，就是通过子查询，查到count大于1的记录，不过这个语句经过测试，是错误的。

delete tt from tt t1
where
(select count(mobile) from tt t2 where t2.mobile = t1.mobile)>1

结果：(4 行受影响)

一下子删除了4条记录，也就是只剩下第4和第6条记录，为什么会这样呢？主要是子查询没有变化，虽然tt一开始会删除一条重复的记录，但是子查询的表没变化，所以导致凡是tt表中count(mobile)>1的，也就是有重复的记录，整个都删除了。

2.第二种写法，虽然语法有点问题，但是回家后，经过测试，应该说基本的想法是对的，就是查了一下资料，把语法改了一下。下面的语句适合mobile字段重复的情况，如果还有其它字段也重复，稍加修改就行。

delete t from t t1 inner join
(
select *,row_number() over(partition by mobile order by mobile) as row from t
) t2
on t1.ename = t2.ename and t1.mobile = t2.mobile and t1.relation = t2.relation
where t2.row >1

SQL Server 分析和编译时间:
   CPU 时间 = 0 毫秒，占用时间 = 1 毫秒。

SQL Server 执行时间:
   CPU 时间 = 0 毫秒，占用时间 = 1 毫秒。
SQL Server 分析和编译时间:
   CPU 时间 = 47 毫秒，占用时间 = 53 毫秒。
表 'tt'。扫描计数 2，逻辑读取 6 次，物理读取 0 次，预读 0 次，lob 逻辑读取 0 次，lob 物理读取 0 次，lob 预读 0 次。
表 'Worktable'。扫描计数 2，逻辑读取 22 次，物理读取 0 次，预读 0 次，lob 逻辑读取 0 次，lob 物理读取 0 次，lob 预读 0 次。

(2 行受影响)

(1 行受影响)

SQL Server 执行时间:
   CPU 时间 = 0 毫秒，占用时间 = 0 毫秒。
SQL Server 分析和编译时间:
   CPU 时间 = 0 毫秒，占用时间 = 1 毫秒。

SQL Server 执行时间:
   CPU 时间 = 0 毫秒，占用时间 = 1 毫秒。

从效率上看，对数据缓冲的读取次数比较多，特别是Worktble的读取次数特别高，毕竟表中只有6条记录。

呵呵，面试人员后来问我，为什么我做的几道题里都用row_number，我没回答，其实是最近正好看到这个函数，确实改进了sql server 2000中显示行号的效率，挺有用的。

另一种更为简单的删除重复记录的写法：

[sql] view plain copy print ?

drop table rt
create table rt(a varchar(10),b varchar(10))
insert into rt
select 'a1', 'b1' union all
select 'a1', 'b1' union all
select 'a1', 'b2'
;with t
as
(
select *,
row_number() over(partition by a,b order by a) rownum
from rt
)
delete from t
where rownum = 2
select *
from rt
/*
a b
a1 b1
a1 b2
*/
/*
;with t
as
(
select *,
row_number() over(partition by a,b order by a) rownum
from rt
)
--很有意思的时候,这种写法会把所有的3条数据都删除掉,所以这种写法是有问题的
delete from t
from t t2
inner join t t1
on t1.a = t2.a
and t1.b = t2.b
and t1.rownum = t2.rownum
and t1.rownum = 2
*/

drop table rt

create table rt(a varchar(10),b varchar(10))

insert into rt
select 'a1',   'b1' union all
select 'a1',  'b1'  union all
select 'a1', 'b2'


;with t
as
(
select *,
       row_number() over(partition by a,b order by a) rownum
from rt
)


delete from t
where rownum = 2


select *
from rt
/*
a	b
a1	b1
a1	b2
*/


/*

;with t
as
(
select *,
       row_number() over(partition by a,b order by a) rownum
from rt
)

--很有意思的时候,这种写法会把所有的3条数据都删除掉,所以这种写法是有问题的
delete from t 
from t t2
inner join t t1
        on t1.a = t2.a
           and t1.b = t2.b
           and t1.rownum = t2.rownum
           and t1.rownum = 2
*/