dataframe对重复项的处理

目录

前言

在数据清洗的过程中,经常涉及到对数据中重复项的处理。在重复项的处理,一般是使用drop_duplicates函数

pandas drop_duplicates官方文档地址:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html

keep参数

重复项筛选的时候,需要选择特定项进行保留。keep参数控制对重复项的选取

  • keep = ‘first’,选取重复项中第一项**(默认值)**
  • keep = ‘last’,选取重复项中最后一项
  • keep = False,清除重复项目
import pandas as pd

if __name__ == '__main__':
	dic = {'id': ['1', '2', '3', '4'], 'email': ['bob@email.com', 'james@email.com', 'john@email.com', 'james@email.com']}
	df = pd.DataFrame(dic)
	df1 = df.drop_duplicates(subset = ['email'], keep = 'first')
	df2 = df.drop_duplicates(subset = ['email'], keep = 'last')
	df3 = df.drop_duplicates(subset = ['email'], keep = False)

	print('keep = \'first\'')
	print(df1)
	print("-------------------------------")
	print('keep = \'last\'')
	print(df2)
	print("-------------------------------")
	print('keep = False')
	print(df3)
"""
result of output:

keep = 'first'
  id            email
0  1    bob@email.com
1  2  james@email.com
2  3   john@email.com
-------------------------------
keep = 'last'
  id            email
0  1    bob@email.com
2  3   john@email.com
3  4  james@email.com
-------------------------------
keep = False
  id           email
0  1   bob@email.com
2  3  john@email.com

"""
  

subset参数

subset控制重复项筛选的选取,可以针对一项或者多项对重复项进行筛选

if __name__ == '__main__':
	dic = {'id': ['1', '1', '3', '4'], 'email': ['bob@email.com', 'james@email.com', 'john@email.com', 'james@email.com']}
	df = pd.DataFrame(dic)
	df1 = df.drop_duplicates(subset = ['id'])
	df2 = df.drop_duplicates(subset = ['email'])
	df3 = df.drop_duplicates(subset = ['id', 'email'])
	
	print('subset = [\'id\']')
	print(df1)
	print("-------------------------------")
	print('subset = [\'email\']')
	print(df2)
	print("-------------------------------")
	print('subset = [\'id\', \'email\']')
	print(df3)
"""
result of outpout:

subset = ['id']
  id            email
0  1    bob@email.com
2  3   john@email.com
3  4  james@email.com
-------------------------------
subset = ['email']
  id            email
0  1    bob@email.com
1  1  james@email.com
2  3   john@email.com
-------------------------------
subset = ['id', 'email']
  id            email
0  1    bob@email.com
1  1  james@email.com
2  3   john@email.com
3  4  james@email.com
"""

inplace参数

inplace参数控制是否修改原来的dataframe

  • inplace = False,不修改原来的dataframe**(默认值)**
  • inplace = True,修改原来的dataframe
if __name__ == '__main__':
	dic = {'id': ['1', '2', '3', '4'], 'email': ['bob@email.com', 'james@email.com', 'john@email.com', 'james@email.com']}
	df = pd.DataFrame(dic)
	# inplace默认False
	df.drop_duplicates(subset = ['email'])
	print('inplace = False')
	print(df)
	print("-------------------------------")
	df.drop_duplicates(subset = ['email'], inplace = True)
	print('inplace = True')
	print(df)
"""
inplace = False
  id            email
0  1    bob@email.com
1  2  james@email.com
2  3   john@email.com
3  4  james@email.com
-------------------------------
inplace = True
  id            email
0  1    bob@email.com
1  2  james@email.com
2  3   john@email.com
"""

ignore_index参数

ignore_index参数控制是否对删除重复项后的dataframe重新建立索引

  • ignore_index = False,不对删除重复项后的dataframe重建索引**(默认值)**
  • ignore_index = True,对删除重复项后的dataframe重建索引,1、2、3、。。。、n-1
if __name__ == '__main__':
	dic = {'id': ['1', '2', '3', '4'], 'email': ['bob@email.com', 'james@email.com', 'john@email.com', 'james@email.com']}
	df = pd.DataFrame(dic)
	df1 = df.drop_duplicates(subset = ['email'], keep = 'last')
	df2 = df.drop_duplicates(subset = ['email'], keep = 'last', ignore_index = True)

	print('ignore_index = False')
	print(df1)
	print("-------------------------------")
	print('ignore_index = True')
	print(df2)
"""
ignore_index = False
  id            email
0  1    bob@email.com
2  3   john@email.com
3  4  james@email.com
-------------------------------
ignore_index = True
  id            email
0  1    bob@email.com
1  3   john@email.com
2  4  james@email.com
"""
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值