目录
1 缺失值的统计和删除
1.1 缺失信息的统计
isna
或isnull
函数:缺失数据可以使用isna
或isnul
l (两个函数没有区别)来查看每个单元格是否缺失,结合mean
可以计算出每列缺失值的比例- 【例子】
df = pd.read_csv('data/learn_pandas.csv', usecols = ['Grade', 'Name', 'Gender', 'Height', 'Weight', 'Transfer']) df.isna().head() ''' Grade Name Gender Height Weight Transfer 0 False False False False False False 1 False False False False False False 2 False False False False False False 3 False False False True False False 4 False False False False False False ''' df.isna().mean() # 查看缺失的比例 ''' Grade 0.000 Name 0.000 Gender 0.000 Height 0.085 Weight 0.055 Transfer 0.060 dtype: float64 '''
- 想要查看某一列缺失或者非缺失的行,可以利用
Series
上的isna
或者notna
进行布尔索引。- 【例子】查看身高缺失的行
df[df.Height.isna()].head() ''' Grade Name Gender Height Weight Transfer 3 Sophomore Xiaojuan Sun Female NaN 41.0 N 12 Senior Peng You Female NaN 48.0 NaN 26 Junior Yanli You Female NaN 48.0 N 36 Freshman Xiaojuan Qin Male NaN 79.0 Y 60 Freshman Yanpeng Lv Male NaN 65.0 N ''' ```
- 如果想要同时对几个列,检索出全部为缺失或者至少有一个缺失或者没有缺失的行,可以使用
isna, notna
和any, all
的组合。- 【例子】对身高、体重和转系情况这3列分别进行这三种情况的检索
sub_set = df[['Height', 'Weight', 'Transfer']] df[sub_set.isna().all(1)] # 全部缺失 ''' Grade Name Gender Height Weight Transfer 102 Junior Chengli Zhao Male NaN NaN NaN ''' df[sub_set.isna().any(1)].head() # 至少有一个缺失 ''' Grade Name Gender Height Weight Transfer 3 Sophomore Xiaojuan Sun Female NaN 41.0 N 9 Junior Juan Xu Female 164.8 NaN N 12 Senior Peng You Female NaN 48.0 NaN 21 Senior Xiaopeng Shen Male 166.0 62.0 NaN 26 Junior Yanli You Female NaN 48.0 N ''' df[sub_set.notna().all(1)].head() # 没有缺失 ''' Grade Name Gender Height Weight Transfer 0 Freshman Gaopeng Yang Female 158.9 46.0 N 1 Freshman Changqiang You Male 166.5 70.0 N 2 Senior Mei Sun Male 188.9 89.0 N 4 Sophomore Gaojuan You Male 174.0 74.0 N 5 Freshman Xiaoli Qian Female 158.0 51.0 N '''
1.2 缺失信息的删除
dropna
函数:需要根据缺失值的大小、比例或其他特征来进行行样本或列特征的删除- 参数:
轴方向axis
(默认为0,即删除行);
删除方式how
(any和all
);
删除的非缺失值个数阈值thresh
(非缺失值 没有达到这个数量的相应维度会被删除);
备选的删除子集subset
; - 【例子】删除身高体重至少有一个缺失的行
res = df.dropna(how = 'any', subset = ['Height', 'Weight']) res.shape ''' (174, 6) '''
- 【例子】删除超过15个缺失值的列
#身高删除 res=df.dropna(1,thresh=df.shape[0]-15) res.head() ''' Grade Name Gender Weight Transfer 0 Freshman Gaopeng Yang Female 46.0 N 1 Freshman Changqiang You Male 70.0 N 2 Senior Mei Sun Male 89.0 N 3 Sophomore Xiaojuan Sun Female 41.0 N 4 Sophomore Gaojuan You Male 74.0 N '''
- 【例子】上面例子可以使用布尔索引来完成
这部分重点复习下
res = df.loc[df[['Height', 'Weight']].notna().all(1)] res.shape ''' (174, 6) ''' res = df.loc[:, ~(df.isna().sum()>15)] res.head() ''' Grade Name Gender Weight Transfer 0 Freshman Gaopeng Yang Female 46.0 N 1 Freshman Changqiang You Male 70.0 N 2 Senior Mei Sun Male 89.0 N 3 Sophomore Xiaojuan Sun Female 41.0 N 4 Sophomore Gaojuan You Male 74.0 N '''
- 参数:
2 缺失值的填充和插值
2.1 利用fillna进行填充
fillna
函数:- 参数:
value
: 为填充值,可以是标量,也可以是索引到元素的字典映射;
method
:为填充方法,有用前面的元素填充ffill
和用后面的元素填充bfill
两种类型;
limit
:参数表示连续缺失值的最大填充次数 - 【例子】
s = pd.Series([np.nan, 1, np.nan, np.nan, 2, np.nan],list('aaabcd')) ''' a NaN a 1.0 a NaN b NaN c 2.0 d NaN dtype: float64 ''' # 用前面的值向后填充 s.fillna(method='ffill') ''' a NaN a 1.0 a 1.0 b 1.0 c 2.0 d 2.0 dtype: float64 ''' # 连续出现的缺失,最多填充一次 s.fillna(method='ffill', limit=1) ''' a NaN a 1.0 a 1.0 b NaN c 2.0 d 2.0 dtype: float64 ''' # value为标量 s.fillna(s.mean()) ''' a 1.5 a 1.0 a 1.5 b 1.5 c 2.0 d 1.5 dtype: float64 ''' # 通过索引映射填充的值 s.fillna({ 'a': 100, 'd': 200}) #没有设置b的索引因此没有填充 ''' a 100.0 a 1.0 a 100.0 b NaN c 2.0 d 200.0 dtype: float64 '''
- 参数:
- 为了更加合理地填充,需要先进行分组后再操作
- 【例子】根据年级进行身高的均值填充
df.groupby('Grade')['Height'] .transform(lambda x: x.fillna(x.mean())).head() ''' 0 158.900000 1 166.500000 2 188.900000 3 163.075862 4