DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)
- subset: 默认使用所有columns, 传入column label or sequence of labels
- keep: 默认 first
- first: 保留第一项
- last: 保留最后一项
- False: 删除所有重复项
- inplace:
- True: 原来的df会被修改, 同时不会返回新的df
- False: 原来的df不会被修改, 会返回新的df
Example
当subset是传入很多个值时, 要多个字段联合起来都是一样的才删除.
In [48]: df
Out[48]:
a b c d e f g
0 49 75 49 50 1 1 1
1 89 87 27 69 2 1 1
2 41 1 75 99 3 2 1
3 8 19 71 6 4 3 1
4 0 59 92 39 4 4 1
In [49]: dff = df.drop_duplicates(subset=['f', 'g'])
In [50]: dff
Out[50]:
a b c d e f g
0 49 75 49 50 1 1 1
2 41 1 75 99 3 2 1
3 8 19 71 6 4 3 1
4 0 59 92 39 4 4 1
In [51]: dff2 = df.drop_duplicates(subset=['e','f', 'g'])
In [52]: dff2
Out[52]:
a b c d e f g
0 49 75 49 50 1 1 1
1 89 87 27 69 2 1 1
2 41 1 75 99 3 2 1
3 8 19 71 6 4 3 1
4 0 59 92 39 4 4 1
In [26]: df = pd.DataFrame(np.random.randint(0, 100, size=(5, 5)), index=list(range(5)), columns=list('abcde'))
In [27]: df['f'] = [1, 1, 2, 3, 4]
In [28]: df
Out[28]:
a b c d e f
0 42 55 55 39 61 1
1 27 51 26 26 64 1
2 87 11 23 2 77 2
3 82 98 61 15 88 3
4 25 21 47 79 4 4
In [29]: dff = df.drop_duplicates(subset=['f'], keep='first')
In [30]: dff
Out[30]:
a b c d e f
0 42 55 55 39 61 1
2 87 11 23 2 77 2
3 82 98 61 15 88 3
4 25 21 47 79 4 4
In [31]: df
Out[31]:
a b c d e f
0 42 55 55 39 61 1
1 27 51 26 26 64 1
2 87 11 23 2 77 2
3 82 98 61 15 88 3
4 25 21 47 79 4 4
In [34]: new = df.drop_duplicates(subset=['f'], keep='first', inplace=True)
In [35]: new
In [36]: df
Out[36]:
a b c d e f
0 42 55 55 39 61 1
2 87 11 23 2 77 2
3 82 98 61 15 88 3
4 25 21 47 79 4 4