1.对列表去重
1.1.用for或while去重
1.2.用集合的特性set()
>>> l = [1,4,3,3,4,2,3,4,5,6,1]
>>> type(l)
<class 'list'>
>>> set(l)
{1, 2, 3, 4, 5, 6}
>>> res = list(set(l))
>>> res
[1, 2, 3, 4, 5, 6]
1.3.使用itertools模块的grouby方法
>>> li2 = [1,4,3,3,4,2,3,4,5,6,1]
>>> li2.sort() # 排序
>>> it = itertools.groupby(li2)
>>> for k, g in it:
... print (k)
...
1
2
3
4
5
6
1.4.使用keys()方式
>>> li4 = [1,0,3,7,7,5]
>>> {}.fromkeys(li4)
{1: None, 0: None, 3: None, 7: None, 5: None}
>>> {}.fromkeys(li4).keys()
dict_keys([1, 0, 3, 7, 5])
>>> list({}.fromkeys(li4).keys())
[1, 0, 3, 7, 5]
1.5.使用unique
对于一维数组或者列表,unique函数去除其中重复的元素,并按元素由大到小返回一个新的无元素重复的元组或者列表
-
return_index=True:返回新列表a=[1 2 3 4 5]中每个元素在原列表A = [1 2 5 3 4 3]中第一次出现的索引值
-
return_inverse=True:返回原列表A = [1 2 5 3 4 3]中每个元素在新列表a=[1 2 3 4 5]中的索引值
>>> A = [1, 2, 5, 3, 4, 3]
>>> a, s, p = np.unique(A, return_index=True, return_inverse=True)
>>> print ("新列表:",a)
新列表: [1 2 3 4 5]
>>> print ("return_index", s)
return_index [0 1 3 4 2]
>>> print ("return_inverse", p)
return_inverse [0 1 4 2 3 2]
2.对数据框去重
2.1.用unique()对单属性列去重
>>> import pandas as pd
>>> data = {'id':['A','B','C','C','C','A','B','C','A'],'age':[18,20,14,10,50,14,65,14,98]}
>>> data = pd.DataFrame(data)
>>> data.id.unique()
array(['A', 'B', 'C'], dtype=object)
###或者
>>> np.unique(data.id)
array(['A', 'B', 'C'], dtype=object)
2.2.用frame.drop_duplicates()对单属性列去重
>>> data.drop_duplicates(['id'])
id age
0 A 18
1 B 20
2 C 14
2.3.用frame.drop_duplicates()对多属性列去重
>>> data.drop_duplicates(['id','age'])
id age
0 A 18
1 B 20
2 C 14
3 C 10
4 C 50
5 A 14
6 B 65
8 A 98
2.4.用frame.duplicated()对多属性列去重
>>> isduplicated = data.duplicated(['id','age'],keep='first')
>>> data.loc[~isduplicated,:]
id age
0 A 18
1 B 20
2 C 14
3 C 10
4 C 50
5 A 14
6 B 65
8 A 98
>>> data.loc[isduplicated,:]
id age
7 C 14