对于数据重复的问题我们可以使用duplicated(),drop_duplicates()方法
注意:两个方法都为DataFrame数据对象的方法
duplicated()判断数据是否重复,返回的是布尔值
drop_duplicates()删除重复的行,可以通过指定参数,限定以哪列作为判断依据
duplicated()使用方法:
df1 = pd.DataFrame({'ID':[1.0,2.0,1.0,2.0,3.0],'lenget':[5.1,4.9,5.1,4.9,4.7],'width':[3.5,3.0,3.5,3.0,3.2],'lenhet1':[1.4,1.4,1.4,1.4,1.3],
'width2':[0.2,0.2,0.2,0.2,0.2],'Test':['test','test','test','test','test']})
print(df1)
print(df1.duplicated())
运行结果:
ID lenget width lenhet1 width2 Test
0 1.0 5.1 3.5 1.4 0.2 test
1 2.0 4.9 3.0 1.4 0.2 test
2 1.0 5.1 3.5 1.4 0.2 test
3 2.0 4.9 3.0 1.4 0.2 test
4 3.0 4.7 3.2 1.3 0.2 test
0 False
1 False
2 True
3 True
4 False
drop_duplicates()运行使用方法:
df1 = pd.DataFrame({'ID':[1.0,2.0,1.0,2.0,3.0],'lenget':[5.1,4.9,5.1,4.9,4.7],'width':[3.5,3.0,3.5,3.0,3.2],'lenhet1':[1.4,1.4,1.4,1.4,1.3],
'width2':[0.2,0.2,0.2,0.2,0.2],'Test':['test','test','test','test','Test']})
print(df1)
print(df1.drop_duplicates())
print(df1.drop_duplicates('width'))#添加width作为判断
运行结果:
ID lenget width lenhet1 width2 test
0 1.0 5.1 3.5 1.4 0.2 test
1 2.0 4.9 3.0 1.4 0.2 test
2 1.0 5.1 3.5 1.4 0.2 test
3 2.0 4.9 3.0 1.4 0.2 test
4 3.0 4.7 3.2 1.3 0.2 test
ID lenget width lenhet1 width2 test
0 1.0 5.1 3.5 1.4 0.2 test
1 2.0 4.9 3.0 1.4 0.2 test
4 3.0 4.7 3.2 1.3 0.2 test
ID lenget width lenhet1 width2 test
0 1.0 5.1 3.5 1.4 0.2 test
1 2.0 4.9 3.0 1.4 0.2 test
4 3.0 4.7 3.2 1.3 0.2 test