删除重复电子邮件，并保留最小id的行

yyyyyyyxy

已于 2023-09-15 16:33:48 修改

阅读量145

点赞数

文章标签： sql

于 2023-09-15 15:07:33 首次发布

本文链接：https://blog.csdn.net/yyyyyyyxy/article/details/132902571

版权

一、问题

编写解决方案删除所有重复的电子邮件，只保留一个具有最小 id 的唯一电子邮件。
输入:
Person 表:

+----+------------------+
| id | email            |
+----+------------------+
| 1  | john@example.com |
| 2  | bob@example.com  |
| 3  | john@example.com |
+----+------------------+

输出:

+----+------------------+
| id | email            |
+----+------------------+
| 1  | john@example.com |
| 2  | bob@example.com  |
+----+------------------+

解释: john@example.com重复两次。我们保留最小的Id = 1。

二、使用步骤

1.使用sql

代码如下（示例）：

第一种方法
delete p1
from Person p1 join Person p2 on p1.email =p2.email
where  p1.id > p2.id

第二种方法
DELETE p1 FROM Person p1, Person p2
WHERE p1.email = p2.email AND p1.id > p2.id

第三种方法
DELETE 
from Person
where id not in(
    select min_id
    from (
        select email, min(id) as min_id
        from Person
        group by email
    ) id
)

2.使用pandas

我们被要求在原地修改 Person。因此，我们可以使用带有 inplace=True 的 drop 方法，根据 removed_person.index 提供的索引值来删除所有行。完整的代码如下：

import pandas as pd

def delete_duplicate_emails(person: pd.DataFrame) -> None:
    min_id = person.groupby('email')['id'].transform('min')
    removed_person = person[person['id'] != min_id] 
    person.drop(removed_person.index, inplace=True)
    return
```c

# 总结