dataframe对重复项的处理

最新推荐文章于 2024-05-03 11:07:56 发布

所有不开心都是闲出来的

最新推荐文章于 2024-05-03 11:07:56 发布

阅读量216

点赞数

分类专栏：数据分析文章标签： python

本文链接：https://blog.csdn.net/qq_42931634/article/details/134361227

版权

数据分析专栏收录该内容

2 篇文章 0 订阅

订阅专栏

前言

在数据清洗的过程中，经常涉及到对数据中重复项的处理。在重复项的处理，一般是使用drop_duplicates函数

pandas drop_duplicates官方文档地址：

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html

keep参数

重复项筛选的时候，需要选择特定项进行保留。keep参数控制对重复项的选取

keep = ‘first’，选取重复项中第一项**（默认值）**
keep = ‘last’，选取重复项中最后一项
keep = False，清除重复项目

import pandas as pd

if __name__ == '__main__':
	dic = {'id': ['1', '2', '3', '4'], 'email': ['bob@email.com', 'james@email.com', 'john@email.com', 'james@email.com']}
	df = pd.DataFrame(dic)
	df1 = df.drop_duplicates(subset = ['email'], keep = 'first')
	df2 = df.drop_duplicates(subset = ['email'], keep = 'last')
	df3 = df.drop_duplicates(subset = ['email'], keep = False)

	print('keep = \'first\'')
	print(df1)
	print("-------------------------------")
	print('keep = \'last\'')
	print(df2)
	print("-------------------------------")
	print('keep = False')
	print(df3)
"""
result of output:

keep = 'first'
  id            email
0  1    bob@email.com
1  2  james@email.com
2  3   john@email.com
-------------------------------
keep = 'last'
  id            email
0  1    bob@email.com
2  3   john@email.com
3  4  james@email.com
-------------------------------
keep = False
  id           email
0  1   bob@email.com
2  3  john@email.com

"""

subset参数

subset控制重复项筛选的选取，可以针对一项或者多项对重复项进行筛选

if __name__ == '__main__':
	dic = {'id': ['1', '1', '3', '4'], 'email': ['bob@email.com', 'james@email.com', 'john@email.com', 'james@email.com']}
	df = pd.DataFrame(dic)
	df1 = df.drop_duplicates(subset = ['id'])
	df2 = df.drop_duplicates(subset = ['email'])
	df3 = df.drop_duplicates(subset = ['id', 'email'])
	
	print('subset = [\'id\']')
	print(df1)
	print("-------------------------------")
	print('subset = [\'email\']')
	print(df2)
	print("-------------------------------")
	print('subset = [\'id\', \'email\']')
	print(df3)
"""
result of outpout:

subset = ['id']
  id            email
0  1    bob@email.com
2  3   john@email.com
3  4  james@email.com
-------------------------------
subset = ['email']
  id            email
0  1    bob@email.com
1  1  james@email.com
2  3   john@email.com
-------------------------------
subset = ['id', 'email']
  id            email
0  1    bob@email.com
1  1  james@email.com
2  3   john@email.com
3  4  james@email.com
"""

inplace参数

inplace参数控制是否修改原来的dataframe

inplace = False，不修改原来的dataframe**（默认值）**
inplace = True，修改原来的dataframe

if __name__ == '__main__':
	dic = {'id': ['1', '2', '3', '4'], 'email': ['bob@email.com', 'james@email.com', 'john@email.com', 'james@email.com']}
	df = pd.DataFrame(dic)
	# inplace默认False
	df.drop_duplicates(subset = ['email'])
	print('inplace = False')
	print(df)
	print("-------------------------------")
	df.drop_duplicates(subset = ['email'], inplace = True)
	print('inplace = True')
	print(df)
"""
inplace = False
  id            email
0  1    bob@email.com
1  2  james@email.com
2  3   john@email.com
3  4  james@email.com
-------------------------------
inplace = True
  id            email
0  1    bob@email.com
1  2  james@email.com
2  3   john@email.com
"""

ignore_index参数

ignore_index参数控制是否对删除重复项后的dataframe重新建立索引

ignore_index = False，不对删除重复项后的dataframe重建索引**（默认值）**
ignore_index = True，对删除重复项后的dataframe重建索引，1、2、3、。。。、n-1

if __name__ == '__main__':
	dic = {'id': ['1', '2', '3', '4'], 'email': ['bob@email.com', 'james@email.com', 'john@email.com', 'james@email.com']}
	df = pd.DataFrame(dic)
	df1 = df.drop_duplicates(subset = ['email'], keep = 'last')
	df2 = df.drop_duplicates(subset = ['email'], keep = 'last', ignore_index = True)

	print('ignore_index = False')
	print(df1)
	print("-------------------------------")
	print('ignore_index = True')
	print(df2)
"""
ignore_index = False
  id            email
0  1    bob@email.com
2  3   john@email.com
3  4  james@email.com
-------------------------------
ignore_index = True
  id            email
0  1    bob@email.com
1  3   john@email.com
2  4  james@email.com
"""