Datawhale组队学习(Pandas) task7-缺失数据

第七章 缺失数据

缺失值的统计和删除

缺失信息的统计

# 查看每个单元格是否有缺失 isna()或isnull()
df = pd.read_csv('data/learn_pandas.csv',usecols=['Grade', 'Name', 'Gender', 'Height','Weight', 'Transfer'])
df.isna().head()
GradeNameGenderHeightWeightTransfer
0FalseFalseFalseFalseFalseFalse
1FalseFalseFalseFalseFalseFalse
2FalseFalseFalseFalseFalseFalse
3FalseFalseFalseTrueFalseFalse
4FalseFalseFalseFalseFalseFalse
# 查看缺失的比例
df.isna().mean()
Grade       0.000
Name        0.000
Gender      0.000
Height      0.085
Weight      0.055
Transfer    0.060
dtype: float64
# 利用isna()或notna()进行布尔索引
df[df.Height.isna()].head()
GradeNameGenderHeightWeightTransfer
3SophomoreXiaojuan SunFemaleNaN41.0N
12SeniorPeng YouFemaleNaN48.0NaN
26JuniorYanli YouFemaleNaN48.0N
36FreshmanXiaojuan QinMaleNaN79.0Y
60FreshmanYanpeng LvMaleNaN65.0N
# 同时对几个列操作,例如对身高/体重/转系情况这三列操作
# 三列全部缺失
sub_set = df[['Height', 'Weight', 'Transfer']]
df[sub_set.isna().all(1)]
GradeNameGenderHeightWeightTransfer
102JuniorChengli ZhaoMaleNaNNaNNaN
# 至少有一个缺失
df[sub_set.isna().any(1)].head()
GradeNameGenderHeightWeightTransfer
3SophomoreXiaojuan SunFemaleNaN41.0N
9JuniorJuan XuFemale164.8NaNN
12SeniorPeng YouFemaleNaN48.0NaN
21SeniorXiaopeng ShenMale166.062.0NaN
26JuniorYanli YouFemaleNaN48.0N
# 没有缺失
df[sub_set.notna().all(1)].head()
GradeNameGenderHeightWeightTransfer
0FreshmanGaopeng YangFemale158.946.0N
1FreshmanChangqiang YouMale166.570.0N
2SeniorMei SunMale188.989.0N
4SophomoreGaojuan YouMale174.074.0N
5FreshmanXiaoli QianFemale158.051.0N

缺失信息的删除

dropna()

  • axis 轴方向,默认0,即删除行
  • how 删除方式,有any/all两种选择
  • thresh 删除的非缺失值个数阈值,非缺失值没有达到这个数量的相应维度会被删除
  • subset 备选的删除子集
# 删除身高体重至少有一个缺失的行
res = df.dropna(how='any', subset=['Height','Weight'])
res.shape

# 用布尔索引完成
res = df.loc[df[['Height', 'Weight']].notna().all(1)]

res.head()
GradeNameGenderHeightWeightTransfer
0FreshmanGaopeng YangFemale158.946.0N
1FreshmanChangqiang YouMale166.570.0N
2SeniorMei SunMale188.989.0N
4SophomoreGaojuan YouMale174.074.0N
5FreshmanXiaoli QianFemale158.051.0N
# 删除超过15个缺失值的列
# 非缺失值个数shape[0]-15
res = df.dropna(1, thresh=df.shape[0]-15)
res.head()

# 用布尔索引完成
res = df.loc[:, ~(df.isna().sum()>15)]
GradeNameGenderWeightTransfer
0FreshmanGaopeng YangFemale46.0N
1FreshmanChangqiang YouMale70.0N
2SeniorMei SunMale89.0N
3SophomoreXiaojuan SunFemale41.0N
4SophomoreGaojuan YouMale74.0N

缺失值的填充和插值

利用fillna填充

fillna()

  • value 填充值,可以是标量,也可以是索引到元素的字典映射
  • method 填充方法,ffill 用前面的元素填充,bfill 用后面的元素填充
  • limit 连续缺失值的最大填充次数
s = pd.Series([np.nan, 1, np.nan, np.nan, 2, np.nan], list('aaabcd'))
s
a    NaN
a    1.0
a    NaN
b    NaN
c    2.0
d    NaN
dtype: float64
# 用前面的值向后填充
s.fillna(method='ffill')
a    NaN
a    1.0
a    1.0
b    1.0
c    2.0
d    2.0
dtype: float64
# 连续出现的缺失,最多填充一次
s.fillna(method='ffill', limit=1)
a    NaN
a    1.0
a    1.0
b    NaN
c    2.0
d    2.0
dtype: float64
# 填充值为标量
s.fillna(s.mean())
a    1.5
a    1.0
a    1.5
b    1.5
c    2.0
d    1.5
dtype: float64
# 通过索引映射填充
s.fillna({'a':100, 'd':200, 'c':100})  # c位置没有缺失,不会被填充
a    100.0
a      1.0
a    100.0
b      NaN
c      2.0
d    200.0
dtype: float64
# 先分组再填充
# 根据年级进行身高的均值填充
df.groupby('Grade')['Height'].transform(lambda x:x.fillna(x.mean())).head()
0    158.900000
1    166.500000
2    188.900000
3    163.075862
4    174.000000
Name: Height, dtype: float64

练一练

s.fillna(method='ffill', limit=1)
0    1.0
1    1.0
2    3.0
3    3.0
4    NaN
dtype: float64
s.fillna(method='bfill', limit=1)
0    1.0
1    3.0
2    3.0
3    NaN
4    NaN
dtype: float64
res = pd.DataFrame({'ffill':s.fillna(method='ffill', limit=1),'bfill':s.fillna(method='bfill', limit=1)})

res.mean(1, False)  # 忽略掉nan
0    1.0
1    2.0
2    3.0
3    NaN
4    NaN
dtype: float64

插值函数

s = pd.Series([np.nan, np.nan, 1, np.nan, np.nan, np.nan, 2, np.nan, np.nan])
s.values
# interpolate 默认为线性插值
# limit_direction 控制方向
# limit 控制最大连续缺失值插值个数
res = s.interpolate(limit_direction='backward', limit=1)
res.values

# 双向插值
res = s.interpolate(limit_direction='both', limit=1)
array([ nan, 1.  , 1.  ,  nan,  nan, 1.75, 2.  ,  nan,  nan])
# 最近邻插值,缺失值的元素和离它最近的非缺失值元素一样
s.interpolate('nearest')
# 索引插值
s = pd.Series([0,np.nan,10],index=[0,1,10])
s.interpolate(method='index')
0      0.0
1      1.0
10    10.0
dtype: float64

Nullable类型

缺失记号及其缺陷

Nullable类型的性质

pd.Series([np.nan, 1], dtype='Int64')
0    <NA>
1       1
dtype: Int64
pd.Series([np.nan, True], dtype='boolean')
0    <NA>
1    True
dtype: boolean
pd.Series([np.nan, 'abcd'], dtype='string')
0    <NA>
1    abcd
dtype: string
df_nan = pd.DataFrame({'category':['a','a','b',np.nan,np.nan],
   ....:                        'value':[1,3,5,7,9]})
df_nan
categoryvalue
0a1
1a3
2b5
3NaN7
4NaN9
pd.get_dummies(df_nan.category, dummy_na=True) # dummy_na 增加一列表示nan
abNaN
0100
1100
2010
3001
4001
pd.get_dummies(df_nan.category) 
ab
010
110
201
300
400

练习

缺失值与类别的相关性检验

df = pd.read_csv('data/missing_chi.csv')
df.head()
X_1X_2y
0NaNNaN0
1NaNNaN0
2NaNNaN0
343.0NaN0
4NaNNaN0
# 错误写法
n11 = df[df['X_1'].isna() & df['y']==1].shape[0]
n10 = df[df['X_1'].isna() & df['y']==0].shape[0]
n01 = df[df['X_1'].notna() & df['y']==1].shape[0]
n00 = df[df['X_1'].notna() & df['y']==0].shape[0]

在这里插入图片描述

# 这里在最开始做时,两个条件没有加小括号,导致计算出的结果不对..找了好久的错 发现是这里出了问题
n11 = df[(df['X_1'].isna()) & (df['y']==1)].shape[0]
n10 = df[(df['X_1'].isna()) & (df['y']==0)].shape[0]
n01 = df[(df['X_1'].notna()) & (df['y']==1)].shape[0]
n00 = df[(df['X_1'].notna()) & (df['y']==0)].shape[0]

# 后面就是分别计算E/F,写得非常繁琐...

在这里插入图片描述

# 答案中统计数目的写法太巧妙啦
cat_1 = df.X_1.fillna('NaN').mask(df.X_1.notna()).fillna('NotNaN')  # 控制填充为nan,非空填充为NotNaN
# df.X_1.fillna('NaN').mask(df.X_1.notna(), 'NotNaN')
df_1 = pd.crosstab(cat_1, df.y, margins=True)  # Crosstab 交叉列表取值,同时也会计算边际

# 计算S
def compute_S(my_df): 
    S = []
    for i in range(2):
        for j in range(2):
            E = my_df.iat[i, j]
            F = my_df.iat[i, 2]*my_df.iat[2, j]/my_df.iat[2,2]
            S.append((E-F)**2/F)
    return sum(S)

res1 = compute_S(df_1)
from scipy.stats import chi2
chi2.sf(res1, 1) # 检验p值

用回归模型解决分类问题

这道题参考了答案

第一问

df = pd.read_excel('data/color.xlsx')
df.head(3)
>>>
X1	X2	Color
0	-2.5	2.8	Blue
1	-1.5	1.8	Blue
2	-0.8	2.8	Blue

from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=6)
clf.fit(df.iloc[:,:2], df.Color)
clf.predict([[0.8, -0.2]])
>>> array(['Yellow'], dtype=object)
from sklearn.neighbors import KNeighborsRegressor
df_dummies = pd.get_dummies(df.Color)  # one-hot 编码
stack_list = []
for col in df_dummies.columns:
    clf = KNeighborsRegressor(n_neighbors=6)
    clf.fit(df.iloc[:,:2], df_dummies[col])
    res = clf.predict([[0.8, -0.2]]).reshape(-1,1)
    stack_list.append(res)
stack_list >>> [array([[0.16666667]]), array([[0.33333333]]), array([[0.5]])]
np.hstack(stack_list)  # np.hstack()是把矩阵进行行连接
code_res = pd.Series(np.hstack(stack_list).argmax(1)) # 返回最大值的索引序号
df_dummies.columns[code_res[0]]
>>> 'Yellow'

第二问

from sklearn.neighbors import KNeighborsRegressor
df = pd.read_csv('data/audit.csv')
res_df = df.copy()

在这里插入图片描述

# onehot编码/最大最小归一化→拼接
df = pd.concat([pd.get_dummies(df[['Marital', 'Gender']]), df[['Age','Income','Hours']].apply(lambda x:(x-x.min())/(x.max()-x.min())), df.Employment],1)

在这里插入图片描述

X_train = df[df.Employment.notna()]  # 训练集
X_test = df[df.Employment.isna()]  # 测试集

在这里插入图片描述

df_dummies = pd.get_dummies(X_train.Employment)  # 独热编码
stack_list = []
for col in df_dummies.columns:
	clf = KNeighborsRegressor(n_neighbors=6)
	clf.fit(X_train.iloc[:,:-1], df_dummies[col])
	res = clf.predict(X_test.iloc[:,:-1]).reshape(-1,1)
	stack_list.append(res)

code_res = pd.Series(np.hstack(stack_list).argmax(1))  # 返回类型编码
cat_res = code_res.replace(dict(zip(list(range(df_dummies.shape[0])),df_dummies.columns)))
res_df.loc[res_df.Employment.isna(), 'Employment'] = cat_res.values  # 将预测值赋给NaN值,完成填充
res_df.isna().sum()
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值