数据去重:数据量过亿的情况下用哪种方式去重最好?

最新推荐文章于 2024-05-14 11:15:57 发布

小邱继续努力

最新推荐文章于 2024-05-14 11:15:57 发布

阅读量476

点赞数

文章标签： oracle sql mysql 大数据

本文链接：https://blog.csdn.net/qq_41804037/article/details/130439782

版权

(一)数据量小的情况下,使用rowid删除重复数据直接去重。

思路:group by之后保留rowid最小的数据(去重留一)。

参考代码:

delete 
  from table_name a
 where (a.id,a.name,a.info) in (
                                select id
                                       ,name
                                       ,info
                                  from table_name
                                 group by id
                                          ,name
                                          ,info
                                having count(1)>1 
                                )
   and rowid not in (
                     select min(rowid)
                       from table_name
                      group by id
                               ,name
                               ,info
                     having count(1)>1 
                     );

(二)注意:数据量小的时候还可以,但是数据量大起来执行速度非常慢,实测12h+,所以需要进行优化,将ROWID和DDL(数据表定义语言)结合使用,1h内跑出结果(区别:(一)是在原表上删除重复数据;(二)是新建表过滤数据成为新的需求表)

思路:1.利用rowid新建临时表并插入数据(去重后的数据);

2.用alter修改原表的名字变成历史表(避免表名重复);

3.用alter修改临时表成原表名字成为需求表。

参考代码:

--建新表成为需求表(保存rowid最小的数据)
create table new_table_name as
  select *
    from table_name
   where rowid in (
                   select min(rowid)
                     from table_name
                    group by id
                             ,name
                             ,info 
                   );

--修改原表表名(避免表名重复)
alter table table_name rename to old_table_name;

--将临时表改名成原表表名成为需求表
alter table new_table_name rename to table_name;

(三):如果数据库的性能足够,可以加上并行,执行速度更快(在(二)的基础上增加并行语句)

思路:1.在当前会话中放开执行并行DML(数据操作语言)语句的权利

2.在(二)的基础上增加/*+paralle(表名,并行数)*/

参考代码:

--当前会话中放开并行DML语句的权利
alter session enable paralle DML;

--建新表成为需求表(保存rowid最小的数据)
create table new_table_name as
  select /*+paralle(table_name,8)*/*
    from table_name
   where rowid in (
                   select min(rowid)
                     from table_name
                    group by id
                             ,name
                             ,info 
                   );

--修改原表表名(避免表名重复)
alter table table_name rename to old_table_name;

--将临时表改名成原表表名成为需求表
alter table new_table_name rename to table_name;