第七章 缺失数据
import numpy as np
import pandas as pd
一、缺失值的统计和删除
缺失信息的统计
缺失数据可以使用 isna 或 isnull (两个函数没有区别)来查看每个单元格是否缺失,通过和 sum 的组合
可以计算出每列缺失值的比例:
df = pd.read_csv('data/learn_pandas.csv',usecols = ['Grade', 'Name', 'Gender', 'Height','Weight', 'Transfer'])
df.isna().head()
|
Grade |
Name |
Gender |
Height |
Weight |
Transfer |
0 |
False |
False |
False |
False |
False |
False |
1 |
False |
False |
False |
False |
False |
False |
2 |
False |
False |
False |
False |
False |
False |
3 |
False |
False |
False |
True |
False |
False |
4 |
False |
False |
False |
False |
False |
False |
df.isna().sum()/df.shape[0]
Grade 0.000
Name 0.000
Gender 0.000
Height 0.085
Weight 0.055
Transfer 0.060
dtype: float64
如果想要查看某一列缺失或者非缺失的行,可以利用 Series 上的 isna 或者 notna 进行布尔索引。例如,
查看身高缺失的行:
df[df.Height.isna()].head()
|
Grade |
Name |
Gender |
Height |
Weight |
Transfer |
3 |
Sophomore |
Xiaojuan Sun |
Female |
NaN |
41.0 |
N |
12 |
Senior |
Peng You |
Female |
NaN |
48.0 |
NaN |
26 |
Junior |
Yanli You |
Female |
NaN |
48.0 |
N |
36 |
Freshman |
Xiaojuan Qin |
Male |
NaN |
79.0 |
Y |
60 |
Freshman |
Yanpeng Lv |
Male |
NaN |
65.0 |
N |
如果想要同时对几个列,检索出全部为缺失或者至少有一个缺失或者没有缺失的行,可以使用 isna,
notna 和 any, all 的组合。例如,对身高、体重和转系情况这3列分别进行这三种情况的检索:
sub_set = df[['Height', 'Weight', 'Transfer']]
df[sub_set.isna().all(1)]
|
Grade |
Name |
Gender |
Height |
Weight |
Transfer |
102 |
Junior |
Chengli Zhao |
Male |
NaN |
NaN |
NaN |
df[sub_set.isna().any(1)].head()
|
Grade |
Name |
Gender |
Height |
Weight |
Transfer |
3 |
Sophomore |
Xiaojuan Sun |
Female |
NaN |
41.0 |
N |
9 |
Junior |
Juan Xu |
Female |
164.8 |
NaN |
N |
12 |
Senior |
Peng You |
Female |
NaN |
48.0 |
NaN |
21 |
Senior |
Xiaopeng Shen |
Male |
166.0 |
62.0 |
NaN |
26 |
Junior |
Yanli You |
Female |
NaN |
48.0 |
N |
df[sub_set.notna().all(1)].head()
|
Grade |
Name |
Gender |
Height |
Weight |
Transfer |
0 |
Freshman |
Gaopeng Yang |
Female |
158.9 |
46.0 |
N |
1 |
Freshman |
Changqiang You |
Male |
166.5 |
70.0 |
N |
2 |
Senior |
Mei Sun |
Male |
188.9 |
89.0 |
N |
4 |
Sophomore |
Gaojuan You |
Male |
174.0 |
74.0 |
N |
5 |
Freshman |
Xiaoli Qian |
Female |
158.0 |
51.0 |
N |
2. 缺失信息的删除
数据处理中经常需要根据缺失值的大小、比例或其他特征来进行行样本或列特征的删除, pandas 中提供
了 dropna 函数来进行操作。