Pandas中数据去重

JasonLiu1919

已于 2022-11-03 14:45:58 修改

阅读量1w

点赞数 2

分类专栏： pandas 数据处理文章标签： python pandas 预处理

于 2021-09-29 11:18:53 首次发布

本文链接：https://blog.csdn.net/ljp1919/article/details/120544746

版权

pandas 同时被 2 个专栏收录

13 篇文章 0 订阅

订阅专栏

数据处理

8 篇文章 0 订阅

订阅专栏

更多、更及时内容欢迎留意微信公众号： 小窗幽记机器学习

背景

在数据处理过程中常常会遇到重复的问题，这里简要介绍遇到过的数据重复问题及其如何根据具体的需求进行处理。

筛选出指定字段存在重复的数据

import pandas as pd

student_dict = {"name": ["Joe", "Nat", "Harry", "Nat"], "age": [20, 21, 19, 21], "marks": [85.10, 77.80, 91.54, 77.80]}

# Create DataFrame from dict
student_df = pd.DataFrame(student_dict)
print(student_df)

duplicated_task_df = student_df[student_df.duplicated(subset=["age"], keep=False)]
print("duplicated_task_df:")
print(duplicated_task_df)

运行结果如下：

在这里插入图片描述

一旦重复即全部删除

一旦出现重复，则相同数据全部删除，一般出现在文本相同，但是标签不一致的场景：

import pandas as pd

student_dict = {"name": ["Joe", "Nat", "Harry", "Nat"], "age": [20, 21, 19, 21], "marks": [85.10, 77.80, 91.54, 77.80]}

# Create DataFrame from dict
student_df = pd.DataFrame(student_dict)
print(student_df)

# drop all duplicate rows
student_df = student_df.drop_duplicates(keep=False)
print("drop all duplicate rows:")
print(student_df)

运行结果如下：

    name  age  marks
0    Joe   20  85.10
1    Nat   21  77.80
2  Harry   19  91.54
3    Nat   21  77.80
drop all duplicate rows:
    name  age  marks
0    Joe   20  85.10
2  Harry   19  91.54

原地操作

上述的去重操作结果是以一个copy出来的新DataFrame，这也是DataFrame.drop_duplicates的默认行为。如果想要直接在现有的DataFrame上进行修改，设置inplace=True即可。

import pandas as pd

student_dict = {"name": ["Joe", "Nat", "Harry", "Joe", "Nat"], "age": [20, 21, 19, 20, 21],
                "marks": [85.10, 77.80, 91.54, 85.10, 77.80]}

# Create DataFrame from dict
student_df = pd.DataFrame(student_dict)
print(student_df)

# drop duplicate rows
student_df.drop_duplicates(inplace=True)
print("drop duplicate rows with inplace=True:")
print(student_df)

运行结果如下：

    name  age  marks
0    Joe   20  85.10
1    Nat   21  77.80
2  Harry   19  91.54
3    Joe   20  85.10
4    Nat   21  77.80
drop duplicate rows with inplace=True:
    name  age  marks
0    Joe   20  85.10
1    Nat   21  77.80
2  Harry   19  91.54

根据指定字段去重后，并重置index

DataFrame.drop_duplicates 默认情况下是保留原始的row index，但是有时候我们需要根据0-N这种等差递增的index做其他操作时候，则需要重置index。

当 ignore_index=True, 重置行index为 0, 1, …, n – 1.
当 ignore_index=False, 则保留原始的行index, 这是默认操作

示例如下：

import pandas as pd

student_dict = {"name": ["Joe", "Nat", "Harry", "Nat"], "age": [20, 21, 19, 21], "marks": [85.10, 77.80, 91.54, 77.80]}

# Create DataFrame from dict
student_df = pd.DataFrame(student_dict, index=['a', 'b', 'c', 'd'])
print(student_df)

# drop duplicate rows
student_df0 = student_df.drop_duplicates(keep=False)
print("drop duplicate rows with ignore_index=False:")
print(student_df0)

# drop duplicate rows
student_df1 = student_df.drop_duplicates(keep=False, ignore_index=True)
print("drop duplicate rows with ignore_index=True:")
print(student_df1)

运行结果如下：

    name  age  marks
a    Joe   20  85.10
b    Nat   21  77.80
c  Harry   19  91.54
d    Nat   21  77.80
drop duplicate rows with ignore_index=False:
    name  age  marks
a    Joe   20  85.10
c  Harry   19  91.54
drop duplicate rows with ignore_index=True:
    name  age  marks
0    Joe   20  85.10
1  Harry   19  91.54

【更多、更及时内容欢迎留意微信公众号： 小窗幽记机器学习 】

JasonLiu1919

关注

2
点赞
踩
14

收藏

觉得还不错? 一键收藏
0
评论
Pandas中数据去重

背景在数据处理过程中常常会遇到重复的问题，这里简要介绍遇到过的数据重复问题及其如何根据具体的需求进行处理。筛选出指定字段存在重复的数据import pandas as pdstudent_dict = {"name": ["Joe", "Nat", "Harry", "Nat"], "age": [20, 21, 19, 21], "marks": [85.10, 77.80, 91.54, 77.80]}# Create DataFrame from dictstudent_df = pd
复制链接

扫一扫

专栏目录