import numpy as np
import pandas as pd
from pandas import Series,DataFrame
一、Pandas中的缺失值
1.Numpy中的nan
s = Series(['a','b',np.nan,'c','d'])
pd.isnull(s)
0 False
1 False
2 True
3 False
4 False
dtype: bool
2.Python中的None
s = Series(['a','b',None,'c','d'])
pd.isnull(s)
0 False
1 False
2 True
3 False
4 False
dtype: bool
二、过滤缺失值
1.Series
dropna()函数
s.dropna()
0 a
1 b
3 c
4 d
dtype: object
布尔数组
s[pd.notnull(s)]
0 a
1 b
3 c
4 d
dtype: object
2.DataFrame
df = DataFrame([[1,3,5,7],
[2,4,np.nan,8],
[np.nan,np.nan,np.nan,np.nan],
[1,1,np.nan,np.nan]])
删除包含缺失值的行
print(df.dropna())
0 1 2 3
0 1.0 3.0 5.0 7.0
删除包含缺失值的列
print(df.dropna(axis=1))
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]
删除全部为缺失值的行
print(df.dropna(how='all'))
0 1 2 3
0 1.0 3.0 5.0 7.0
1 2.0 4.0 NaN 8.0
3 1.0 1.0 NaN NaN
当有效数据不少于thresh时不过滤
print(df.dropna(thresh=3))
0 1 2 3
0 1.0 3.0 5.0 7.0
1 2.0 4.0 NaN 8.0
三、填充缺失数据
1.填充指定值
所有缺失值统一填充
print(df.fillna(0))
0 1 2 3
0 1.0 3.0 5.0 7.0
1 2.0 4.0 0.0 8.0
2 0.0 0.0 0.0 0.0
3 1.0 1.0 0.0 0.0
不同列填充不同值
print(df.fillna({3:-1,2:100})) #第3列填充-1,第2列填充100
0 1 2 3
0 1.0 3.0 5.0 7.0
1 2.0 4.0 100.0 8.0
2 NaN NaN 100.0 -1.0
3 1.0 1.0 100.0 -1.0
2.插值
print(df.fillna(method='ffill'))
0 1 2 3
0 1.0 3.0 5.0 7.0
1 2.0 4.0 5.0 8.0
2 2.0 4.0 5.0 8.0
3 1.0 1.0 5.0 8.0
3.在原始数据上插值,而不是返回新对象
df.fillna(0,inplace=True)
print(df)
0 1 2 3
0 1.0 3.0 5.0 7.0
1 2.0 4.0 0.0 8.0
2 0.0 0.0 0.0 0.0
3 1.0 1.0 0.0 0.0