目录
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
from numpy import nan as NA
清除重复数据
创建有重复的数据
df1 = DataFrame({
'k1':list('aaabbbbccc'),
'k2':[1,1,2,2,3,3,3,4,2,1]
},index=list('ABCDEFGHIJ'))
df1
k1 k2
A a 1
B a 1
C a 2
D b 2
E b 3
F b 3
G b 3
H c 4
I c 2
J c 1
df1.k1.unique()
array(['a', 'b', 'c'], dtype=object)
通过 xxx.duplicated() 返回一个 是否是 重复数据的布尔序列
df1.k1
A a
B a
C a
D b
E b
F b
G b
H c
I c
J c
Name: k1, dtype: object
df1.k1.duplicated()
A False
B True
C True
D False
E True
F True
G True
H False
I True
J True
Name: k1, dtype: bool
df1.k1[~df1.k1.duplicated()]
A a
D b
H c
Name: k1, dtype: object
s1 = Series(list(range(1,40))+[41]+list(range(41,80)))
s1
0 1
1 2
2 3
3 4
4 5
..
74 75
75 76
76 77
77 78
78 79
Length: 79, dtype: int64
s1.duplicated()
0 False
1 False
2 False
3 False
4 False
...
74 False
75 False
76 False
77 False
78 False
Length: 79, dtype: bool
s1.duplicated().value_counts()
False 78
True 1
dtype: int64
s1.is_unique
False
Series.drop_duplicates(keep='first', inplace=False)
df1.k1
A a
B a
C a
D b
E b
F b
G b
H c
I c
J c
Name: k1, dtype: object
df1.k1.drop_duplicates()
A a
D b
H c
Name: k1, dtype: object
df1.k1.drop_duplicates(keep='last')
C a
G b
J c
Name: k1, dtype: object
DataFrame.drop_duplicates(subset=[] ,keep=‘first’ ,inplace = ‘False’)
1、subset = [列1,列2,…] 默认所有列相同的才丢掉
2、keep=‘first’ 保留第一个
3、inplace = ‘False’ 不在原值去除
df1
k1 k2
A a 1
B a 1
C a 2
D b 2
E b 3
F b 3
G b 3
H c 4
I c 2
J c 1
df1.drop_duplicates()
k1 k2
A a 1
C a 2
D b 2
E b 3
H c 4
I c 2
J c 1
df1.drop_duplicates(subset=['k1'])
k1 k2
A a 1
D b 2
H c 4
df1.drop_duplicates(subset=['k2'])
k1 k2
A a 1
C a 2
E b 3
H c 4
df1.drop_duplicates(subset=['k1','k2'])
k1 k2
A a 1
C a 2
D b 2
E b 3
H c 4
I c 2
J c 1
给df1 增加一列
df1[‘k3’] = np.arange(10)*10
在 k1 前增加 k0 ,值 0 100 -900
df1.insert(0,'k0',np.arange(10)*100)
df1
k0 k1 k2 k3
A 0 a 1 0
B 100 a 1 10
C 200 a 2 20
D 300 b 2 30
E 400 b 3 40
F 500 b 3 50
G 600 b 3 60
H 700 c 4 70
I 800 c 2 80
J 900 c 1 90
df1.drop_duplicates()
k0 k1 k2 k3
A 0 a 1 0
B 100 a 1 10
C 200 a 2 20
D 300 b 2 30
E 400 b 3 40
F 500 b 3 50
G 600 b 3 60
H 700 c 4 70
I 800 c 2 80
J 900 c 1 90
df1.drop_duplicates(subset=['k1','k2'] ,ignore_index=True)
k0 k1 k2 k3
0 0 a 1 0
1 200 a 2 20
2 300 b 2 30
3 400 b 3 40
4 700 c 4 70
5 800 c 2 80
6 900 c 1 90
函数映射
df2 = DataFrame(
{'nickname':list('ABcd'),
'unit':[2,4,6,8]}
)
df2
nickname unit
0 A 2
1 B 4
2 c 6
3 d 8
已知 上述 abcd对应四种 动物 的简称。现在有个 字典中有四种,
把四种动物增加到新列,首字母进行匹配
animal = {‘b’:‘belt’,‘c’:‘cycle’,‘d’:‘donkey’,‘a’:‘ant’}
df2.nickname.str.lower()
df2.nickname.apply(str.lower)
df2.nickname.apply(lambda x:x.lower())
0 a
1 b
2 c
3 d
Name: nickname, dtype: object
duiying = df2.nickname.apply(lambda x:animal[x.lower()])
duiying
0 ant
1 belt
2 cycle
3 donkey
Name: nickname, dtype: object
df2['animal'] = duiying
df2
nickname unit animal
0 A 2 ant
1 B 4 belt
2 c 6 cycle
3 d 8 donkey
也可以用 map() series即可以使用 apply 操作与一列。也可以用 map() 操作于每个值
df2.nickname.map(str.lower)
0 a
1 b
2 c
3 d
Name: nickname, dtype: object
替换值
s2 = Series([1,-999,100,-999,10,-1000,10,1000])
s2
0 1
1 -999
2 100
3 -999
4 10
5 -1000
6 10
7 1000
dtype: int64
把 -999 换成 NA
s2.apply(lambda x:NA if x==-999 else x)
0 1.0
1 NaN
2 100.0
3 NaN
4 10.0
5 -1000.0
6 10.0
7 1000.0
dtype: float64
将一个值换成 NA
s2.replace(-999,NA)
0 1.0
1 NaN
2 100.0
3 NaN
4 10.0
5 -1000.0
6 10.0
7 1000.0
dtype: float64
将多个值换成
-999 ,-1000 ,6666
s2.replace([-999,-1000],6666)
0 1
1 6666
2 100
3 6666
4 10
5 6666
6 10
7 1000
dtype: int64
将多个值换成 多个值
方法一:
s2.replace([-999,-1000],[6666,7777])
0 1
1 6666
2 100
3 6666
4 10
5 7777
6 10
7 1000
dtype: int64
方法二:
s2.replace({-999:666 ,-1000:777})
0 1
1 666
2 100
3 666
4 10
5 777
6 10
7 1000
dtype: int64