DataFrame.duplicated
(subset=None, keep='first')
返回表示重复行的布尔序列。
Parameters:
1)subset column label or sequence of labels, optional
#用来指定特定的列,默认所有列
Only consider certain columns for identifying duplicates, by default use all of the columns.
2)keep{‘first’, ‘last’, False}, default ‘first’
#删除重复项并保留第一次出现的项
Determines which duplicates (if any) to mark.
-
first
: Mark duplicates asTrue
except for the first occurrence. -
last
: Mark duplicates asTrue
except for the last occurrence.
#keep='last'参数就是让系统从后向前开始筛查,这样索引小的重复行会返回 'True'。
-
False : Mark all duplicates as
True
.
栗子:
import pandas as pd
data=pd.DataFrame({'district':['A','A','B','B','C','C'],'count':[50,50,60,60,80,80]})
重复行返回“True”
data.duplicated()
用drop_duplicates()删除重复行
data.drop_duplicates()
去除后的行索引没有更新,所以用reset_index(drop=True)进行行索引更新
data.drop_duplicates().reset_index(drop=True)