我们在实际工作中,当信息系统数据质量不高的时候,可能存在数据表中有重复记录的问题。方法:
1、保留重复记录中的一条
delete from t where rowid not in (select min(rowid) from t group by 去重字段); not in 可使用!=
2、select 字段1,字段2,count(*) from 表名 group by 字段1,字段2 having count(*) > 1
将上面的>号改为=号就可以查询出没有重复的数据了。
3、要删除重复数据,建议在临时表中操作,提高性能
CREATE TABLE 临时表 AS (select 字段1,字段2,count(*) from 表名 group by 字段1,字段2 having count(*) > 1) ;
delete from 表名 a where 字段1,字段2 in (select 字段1,字段2 from 临时表);
4、为了提高性能,可以在建立临时表时不要LOG和索引;并分析表。
1).通过create table ... as select将不重复的记录重建成表T_TEST_1
create table T_TEST_1 nologging tablespace &tablespace_name as
select col_id1, col_id2, col_3, col_4, col_5
from (select col_id1,
col_id2,
col_3,
col_4,
col_5,
updatetime,
row_number() over(partition by col_id1, col_id2 order by updatetime desc) rn
from T_TEST)
where rn = 1
2).对新表重建索引,原表有多少索引,在新表上也重建多少索引
create index IND_T_TEST_1 on T_TEST_1(col_id1, col_id2)
nologging tablespace &ind_tablespace_name;
3).收集新表统计信息,确保SELECT查询采用正确高效率的执行计划
declare
BEGIN
dbms_stats.gather_table_stats(ownname => '&user',
tabname => 'T_TEST_1',
estimate_percent => DBMS_STATS.AUTO_SAMPLE_SIZE,
cascade => true,
method_opt => 'FOR ALL COLUMNS SIZE 1',
granularity => 'all');
END;
/
4).将新表和新索引更改为日志方式
alter table T_TEST_1 logging;
alter index IND_T_TEST_1 logging;
5.备份旧表,将新表切换上线
alter table T_TEST rename to T_TEST_BAK0902;
alter table T_TEST_1 rename to T_TEST;
不建议方案:不建议直接在原表T_TEST上做DELETE操作
===================================分割线=========================================
oracle里可以用 row_number()连子查询进行处理
select distinct table1.id,table1.name
from (select a.id,b.name, row_number() over (partition by c.wzbah order by b.id desc)rn
fromT1 a,T2 b
where a.id = b.id ) table1
where rn = 1
ps:
partition by 是用后面字段进行分割, rn是行号
这样就只取到行号为1的那一行了
可以用到max()函数
1、要求,在一个表中,某一字段为重复字段。需要去除重复字段。同时将所有字段显示出来
SELECT * FROM (select a1,a2,a3,
Row_number() OVER (PARTITION BY a1 ORDER BY a1) rn
from a
) where RN = 1
Row_number() OVER(PARTITION BY a1 ORDER BY a1)作用Oracle分析函数RANK(),ROW_NUMBER(),LAG()等的使用方法
ROW_NUMBER() OVER (PARTITION BY COL1 ORDER BY COL2)表示根据COL1分组,在分组内部根据 COL2排序,而这个值就表示每组内部排序后的顺序编号(组内连续的唯一的)
RANK() 类似,不过RANK 排序的时候跟派名次一样,可以并列2个第一名之后 是第3名
LAG 表示 分组排序后 ,组内后面一条记录减前面一条记录的差,第一条可返回 NULL
BTW: EXPERT ONE ON ONE 上讲的最详细,还有很多相关特性,文档看起来比较费劲
row_number()和rownum差不多,功能更强一点(可以在各个分组内从1开时排序)
rank()是跳跃排序,有两个第二名时接下来就是第四名(同样是在各个分组内)
dense_rank()l是连续排序,有两个第二名时仍然跟着第三名。
相比之下row_number是没有重复值的
lag(arg1,arg2,arg3):
arg1是从其他行返回的表达式
arg2是希望检索的当前行分区的偏移量。是一个正的偏移量,时一个往回检索以前的行的数目。
arg3是在arg2表示的数目超出了分组的范围时返回的值。
=====================================分割线========================================
select * from people
where peopleId in (select peopleId from people group by peopleId having count(peopleId) > 1)
2、删除表中多余的重复记录,重复记录是根据单个字段(peopleId)来判断,只留有rowid最小的记录
delete from people
where peopleId in (select peopleId from people group by peopleId having count(peopleId) > 1)
and rowid not in (select min(rowid) from people group by peopleId having count(peopleId )>1)
注:rowid为oracle自带不用该.....
select * from vitae a
where (a.peopleId,a.seq) in (select peopleId,seq from vitae group by peopleId,seq having count(*) > 1)
delete from vitae a
where (a.peopleId,a.seq) in (select peopleId,seq from vitae group by peopleId,seq having count(*) > 1)
and rowid not in (select min(rowid) from vitae group by peopleId,seq having count(*)>1)
5、查找表中多余的重复记录(多个字段),不包含rowid最小的记录
select * from vitae a
where (a.peopleId,a.seq) in (select peopleId,seq from vitae group by peopleId,seq having count(*) > 1)
and rowid not in (select min(rowid) from vitae group by peopleId,seq having count(*)>1)
比方说
在A表中存在一个字段“name”,
而且不同记录之间的“name”值有可能会相同,
现在就是需要查询出在该表中的各记录之间,“name”值存在重复的项;
Select Name,Count(*) From A Group By Name Having Count(*) > 1
Select Name,sex,Count(*) From A Group By Name,sex Having Count(*) > 1
(三)
方法一
查询重复