有时候由于业务原因或者其他原因,我们的表中产生了部分重复记录,我们需要只保留一条数据,删除重复的部分。这时候,我们的SQL
该怎么写呢?今天我们就一起来看看。
🎃 首先,我们创建如下的表members
。
create table members (
id int primary key,
member_id int,
member_name varchar(60),
member_age smallint,
create_at datetime default current_timestamp
);
在members
生成测试数据如下:
select * from members;
--
|id |member_id|member_name|member_age|create_at |
|---|---------|-----------|----------|-------------------|
|1 |1001 |a |14 |2024-01-25 05:50:03|
|2 |1001 |a |14 |2024-01-25 05:50:03|
|3 |1002 |b |20 |2024-01-25 05:50:03|
|4 |1003 |c |20 |2024-01-25 05:50:03|
|5 |1004 |h |20 |2024-01-25 05:50:03|
|6 |1004 |h |20 |2024-01-25 05:50:03|
|8 |1004 |h |20 |2024-01-25 05:50:03|
⁉️:我们需要找出member_id
和member_name
重复的记录。
🐛 方法一:窗口函数
通过按照member_id
和member_name
进行分组,然后使用row_number
标记,最后找出row_number
大于1的记录,即为重复记录。
with tmp_members as (
select *, row_number() over (partition by member_id,
member_name order by member_id,
member_name,
create_at) row_num
from
members
) select t.id, t.member_id,
t.member_name,
t.member_age,
t.create_at
from tmp_members t
where
row_num>1;
查询结果:
|id |member_id|member_name|member_age|create_at |
|---|---------|-----------|----------|-------------------|
|2 |1001 |a |14 |2024-01-25 05:50:03|
|6 |1004 |h |20 |2024-01-25 05:50:03|
|8 |1004 |h |20 |2024-01-25 05:50:03|
那么如果我们不使用窗口函数该怎么实现这个需求呢?
🐛 方法二:普通聚合
select m2.*
from
members m2
where
m2.id not in
(select min(m.id)
from members m
group by m.member_id,
m.member_name);
查询结果:
|id |member_id|member_name|member_age|create_at |
|---|---------|-----------|----------|-------------------|
|2 |1001 |a |14 |2024-01-25 05:50:03|
|6 |1004 |h |20 |2024-01-25 05:50:03|
|8 |1004 |h |20 |2024-01-25 05:50:03|