目录
前言
在数据清洗的过程中,经常涉及到对数据中重复项的处理。在重复项的处理,一般是使用drop_duplicates函数
pandas drop_duplicates官方文档地址:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html
keep参数
重复项筛选的时候,需要选择特定项进行保留。keep参数控制对重复项的选取
- keep = ‘first’,选取重复项中第一项**(默认值)**
- keep = ‘last’,选取重复项中最后一项
- keep = False,清除重复项目
import pandas as pd
if __name__ == '__main__':
dic = {'id': ['1', '2', '3', '4'], 'email': ['bob@email.com', 'james@email.com', 'john@email.com', 'james@email.com']}
df = pd.DataFrame(dic)
df1 = df.drop_duplicates(subset = ['email'], keep = 'first')
df2 = df.drop_duplicates(subset = ['email'], keep = 'last')
df3 = df.drop_duplicates(subset = ['email'], keep = False)
print('keep = \'first\'')
print(df1)
print("-------------------------------")
print('keep = \'last\'')
print(df2)
print("-------------------------------")
print('keep = False')
print(df3)
"""
result of output:
keep = 'first'
id email
0 1 bob@email.com
1 2 james@email.com
2 3 john@email.com
-------------------------------
keep = 'last'
id email
0 1 bob@email.com
2 3 john@email.com
3 4 james@email.com
-------------------------------
keep = False
id email
0 1 bob@email.com
2 3 john@email.com
"""
subset参数
subset控制重复项筛选的选取,可以针对一项或者多项对重复项进行筛选
if __name__ == '__main__':
dic = {'id': ['1', '1', '3', '4'], 'email': ['bob@email.com', 'james@email.com', 'john@email.com', 'james@email.com']}
df = pd.DataFrame(dic)
df1 = df.drop_duplicates(subset = ['id'])
df2 = df.drop_duplicates(subset = ['email'])
df3 = df.drop_duplicates(subset = ['id', 'email'])
print('subset = [\'id\']')
print(df1)
print("-------------------------------")
print('subset = [\'email\']')
print(df2)
print("-------------------------------")
print('subset = [\'id\', \'email\']')
print(df3)
"""
result of outpout:
subset = ['id']
id email
0 1 bob@email.com
2 3 john@email.com
3 4 james@email.com
-------------------------------
subset = ['email']
id email
0 1 bob@email.com
1 1 james@email.com
2 3 john@email.com
-------------------------------
subset = ['id', 'email']
id email
0 1 bob@email.com
1 1 james@email.com
2 3 john@email.com
3 4 james@email.com
"""
inplace参数
inplace参数控制是否修改原来的dataframe
- inplace = False,不修改原来的dataframe**(默认值)**
- inplace = True,修改原来的dataframe
if __name__ == '__main__':
dic = {'id': ['1', '2', '3', '4'], 'email': ['bob@email.com', 'james@email.com', 'john@email.com', 'james@email.com']}
df = pd.DataFrame(dic)
# inplace默认False
df.drop_duplicates(subset = ['email'])
print('inplace = False')
print(df)
print("-------------------------------")
df.drop_duplicates(subset = ['email'], inplace = True)
print('inplace = True')
print(df)
"""
inplace = False
id email
0 1 bob@email.com
1 2 james@email.com
2 3 john@email.com
3 4 james@email.com
-------------------------------
inplace = True
id email
0 1 bob@email.com
1 2 james@email.com
2 3 john@email.com
"""
ignore_index参数
ignore_index参数控制是否对删除重复项后的dataframe重新建立索引
- ignore_index = False,不对删除重复项后的dataframe重建索引**(默认值)**
- ignore_index = True,对删除重复项后的dataframe重建索引,1、2、3、。。。、n-1
if __name__ == '__main__':
dic = {'id': ['1', '2', '3', '4'], 'email': ['bob@email.com', 'james@email.com', 'john@email.com', 'james@email.com']}
df = pd.DataFrame(dic)
df1 = df.drop_duplicates(subset = ['email'], keep = 'last')
df2 = df.drop_duplicates(subset = ['email'], keep = 'last', ignore_index = True)
print('ignore_index = False')
print(df1)
print("-------------------------------")
print('ignore_index = True')
print(df2)
"""
ignore_index = False
id email
0 1 bob@email.com
2 3 john@email.com
3 4 james@email.com
-------------------------------
ignore_index = True
id email
0 1 bob@email.com
1 3 john@email.com
2 4 james@email.com
"""