Datawhale组队学习(Pandas) task7-缺失数据

最新推荐文章于 2023-10-26 21:37:44 发布

mintminty

最新推荐文章于 2023-10-26 21:37:44 发布

阅读量214

点赞数 1

分类专栏： datawhale组队学习 code

本文链接：https://blog.csdn.net/mintminty/article/details/112135281

版权

datawhale组队学习同时被 2 个专栏收录

14 篇文章 1 订阅

订阅专栏

code

12 篇文章 0 订阅

订阅专栏

第七章缺失数据

缺失值的统计和删除

缺失信息的统计

# 查看每个单元格是否有缺失 isna()或isnull()
df = pd.read_csv('data/learn_pandas.csv',usecols=['Grade', 'Name', 'Gender', 'Height','Weight', 'Transfer'])
df.isna().head()

	Grade	Name	Gender	Height	Weight	Transfer
0	False	False	False	False	False	False
1	False	False	False	False	False	False
2	False	False	False	False	False	False
3	False	False	False	True	False	False
4	False	False	False	False	False	False

# 查看缺失的比例
df.isna().mean()

Grade       0.000
Name        0.000
Gender      0.000
Height      0.085
Weight      0.055
Transfer    0.060
dtype: float64

# 利用isna()或notna()进行布尔索引
df[df.Height.isna()].head()

	Grade	Name	Gender	Height	Weight	Transfer
3	Sophomore	Xiaojuan Sun	Female	NaN	41.0	N
12	Senior	Peng You	Female	NaN	48.0	NaN
26	Junior	Yanli You	Female	NaN	48.0	N
36	Freshman	Xiaojuan Qin	Male	NaN	79.0	Y
60	Freshman	Yanpeng Lv	Male	NaN	65.0	N

# 同时对几个列操作，例如对身高/体重/转系情况这三列操作
# 三列全部缺失
sub_set = df[['Height', 'Weight', 'Transfer']]
df[sub_set.isna().all(1)]

	Grade	Name	Gender	Height	Weight	Transfer
102	Junior	Chengli Zhao	Male	NaN	NaN	NaN

# 至少有一个缺失
df[sub_set.isna().any(1)].head()

	Grade	Name	Gender	Height	Weight	Transfer
3	Sophomore	Xiaojuan Sun	Female	NaN	41.0	N
9	Junior	Juan Xu	Female	164.8	NaN	N
12	Senior	Peng You	Female	NaN	48.0	NaN
21	Senior	Xiaopeng Shen	Male	166.0	62.0	NaN
26	Junior	Yanli You	Female	NaN	48.0	N

# 没有缺失
df[sub_set.notna().all(1)].head()

	Grade	Name	Gender	Height	Weight	Transfer
0	Freshman	Gaopeng Yang	Female	158.9	46.0	N
1	Freshman	Changqiang You	Male	166.5	70.0	N
2	Senior	Mei Sun	Male	188.9	89.0	N
4	Sophomore	Gaojuan You	Male	174.0	74.0	N
5	Freshman	Xiaoli Qian	Female	158.0	51.0	N

缺失信息的删除

dropna()

axis 轴方向，默认0，即删除行
how 删除方式，有any/all两种选择
thresh 删除的非缺失值个数阈值，非缺失值没有达到这个数量的相应维度会被删除
subset 备选的删除子集

# 删除身高体重至少有一个缺失的行
res = df.dropna(how='any', subset=['Height','Weight'])
res.shape

# 用布尔索引完成
res = df.loc[df[['Height', 'Weight']].notna().all(1)]

res.head()

	Grade	Name	Gender	Height	Weight	Transfer
0	Freshman	Gaopeng Yang	Female	158.9	46.0	N
1	Freshman	Changqiang You	Male	166.5	70.0	N
2	Senior	Mei Sun	Male	188.9	89.0	N
4	Sophomore	Gaojuan You	Male	174.0	74.0	N
5	Freshman	Xiaoli Qian	Female	158.0	51.0	N

# 删除超过15个缺失值的列
# 非缺失值个数shape[0]-15
res = df.dropna(1, thresh=df.shape[0]-15)
res.head()

# 用布尔索引完成
res = df.loc[:, ~(df.isna().sum()>15)]

	Grade	Name	Gender	Weight	Transfer
0	Freshman	Gaopeng Yang	Female	46.0	N
1	Freshman	Changqiang You	Male	70.0	N
2	Senior	Mei Sun	Male	89.0	N
3	Sophomore	Xiaojuan Sun	Female	41.0	N
4	Sophomore	Gaojuan You	Male	74.0	N

缺失值的填充和插值

利用fillna填充

fillna()

value 填充值，可以是标量，也可以是索引到元素的字典映射
method 填充方法，ffill 用前面的元素填充，bfill 用后面的元素填充
limit 连续缺失值的最大填充次数

s = pd.Series([np.nan, 1, np.nan, np.nan, 2, np.nan], list('aaabcd'))
s

a    NaN
a    1.0
a    NaN
b    NaN
c    2.0
d    NaN
dtype: float64

# 用前面的值向后填充
s.fillna(method='ffill')

a    NaN
a    1.0
a    1.0
b    1.0
c    2.0
d    2.0
dtype: float64

# 连续出现的缺失，最多填充一次
s.fillna(method='ffill', limit=1)

a    NaN
a    1.0
a    1.0
b    NaN
c    2.0
d    2.0
dtype: float64

# 填充值为标量
s.fillna(s.mean())

a    1.5
a    1.0
a    1.5
b    1.5
c    2.0
d    1.5
dtype: float64

# 通过索引映射填充
s.fillna({'a':100, 'd':200, 'c':100})  # c位置没有缺失，不会被填充

a    100.0
a      1.0
a    100.0
b      NaN
c      2.0
d    200.0
dtype: float64

# 先分组再填充
# 根据年级进行身高的均值填充
df.groupby('Grade')['Height'].transform(lambda x:x.fillna(x.mean())).head()

0    158.900000
1    166.500000
2    188.900000
3    163.075862
4    174.000000
Name: Height, dtype: float64

练一练

s.fillna(method='ffill', limit=1)

0    1.0
1    1.0
2    3.0
3    3.0
4    NaN
dtype: float64

s.fillna(method='bfill', limit=1)

0    1.0
1    3.0
2    3.0
3    NaN
4    NaN
dtype: float64

res = pd.DataFrame({'ffill':s.fillna(method='ffill', limit=1),'bfill':s.fillna(method='bfill', limit=1)})

res.mean(1, False)  # 忽略掉nan

0    1.0
1    2.0
2    3.0
3    NaN
4    NaN
dtype: float64

插值函数

s = pd.Series([np.nan, np.nan, 1, np.nan, np.nan, np.nan, 2, np.nan, np.nan])
s.values

# interpolate 默认为线性插值
# limit_direction 控制方向
# limit 控制最大连续缺失值插值个数
res = s.interpolate(limit_direction='backward', limit=1)
res.values

# 双向插值
res = s.interpolate(limit_direction='both', limit=1)

array([ nan, 1.  , 1.  ,  nan,  nan, 1.75, 2.  ,  nan,  nan])

# 最近邻插值，缺失值的元素和离它最近的非缺失值元素一样
s.interpolate('nearest')

# 索引插值
s = pd.Series([0,np.nan,10],index=[0,1,10])
s.interpolate(method='index')

0      0.0
1      1.0
10    10.0
dtype: float64

Nullable类型

缺失记号及其缺陷

Nullable类型的性质

pd.Series([np.nan, 1], dtype='Int64')

0    <NA>
1       1
dtype: Int64

pd.Series([np.nan, True], dtype='boolean')

0    <NA>
1    True
dtype: boolean

pd.Series([np.nan, 'abcd'], dtype='string')

0    <NA>
1    abcd
dtype: string

df_nan = pd.DataFrame({'category':['a','a','b',np.nan,np.nan],
   ....:                        'value':[1,3,5,7,9]})
df_nan

	category	value
0	a	1
1	a	3
2	b	5
3	NaN	7
4	NaN	9

pd.get_dummies(df_nan.category, dummy_na=True) # dummy_na 增加一列表示nan

	a	b	NaN
0	1	0	0
1	1	0	0
2	0	1	0
3	0	0	1
4	0	0	1

pd.get_dummies(df_nan.category)

	a	b
0	1	0
1	1	0
2	0	1
3	0	0
4	0	0

练习

缺失值与类别的相关性检验

df = pd.read_csv('data/missing_chi.csv')
df.head()

	X_1	X_2
0	NaN	NaN
1	NaN	NaN
2	NaN	NaN
3	43.0	NaN
4	NaN	NaN

# 错误写法
n11 = df[df['X_1'].isna() & df['y']==1].shape[0]
n10 = df[df['X_1'].isna() & df['y']==0].shape[0]
n01 = df[df['X_1'].notna() & df['y']==1].shape[0]
n00 = df[df['X_1'].notna() & df['y']==0].shape[0]

在这里插入图片描述

# 这里在最开始做时，两个条件没有加小括号，导致计算出的结果不对..找了好久的错 发现是这里出了问题
n11 = df[(df['X_1'].isna()) & (df['y']==1)].shape[0]
n10 = df[(df['X_1'].isna()) & (df['y']==0)].shape[0]
n01 = df[(df['X_1'].notna()) & (df['y']==1)].shape[0]
n00 = df[(df['X_1'].notna()) & (df['y']==0)].shape[0]

# 后面就是分别计算E/F，写得非常繁琐...

在这里插入图片描述

# 答案中统计数目的写法太巧妙啦
cat_1 = df.X_1.fillna('NaN').mask(df.X_1.notna()).fillna('NotNaN')  # 控制填充为nan，非空填充为NotNaN
# df.X_1.fillna('NaN').mask(df.X_1.notna(), 'NotNaN')
df_1 = pd.crosstab(cat_1, df.y, margins=True)  # Crosstab 交叉列表取值，同时也会计算边际

# 计算S
def compute_S(my_df): 
    S = []
    for i in range(2):
        for j in range(2):
            E = my_df.iat[i, j]
            F = my_df.iat[i, 2]*my_df.iat[2, j]/my_df.iat[2,2]
            S.append((E-F)**2/F)
    return sum(S)

res1 = compute_S(df_1)
from scipy.stats import chi2
chi2.sf(res1, 1) # 检验p值

用回归模型解决分类问题

这道题参考了答案

第一问

df = pd.read_excel('data/color.xlsx')
df.head(3)
>>>
X1	X2	Color
0	-2.5	2.8	Blue
1	-1.5	1.8	Blue
2	-0.8	2.8	Blue

from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=6)
clf.fit(df.iloc[:,:2], df.Color)
clf.predict([[0.8, -0.2]])
>>> array(['Yellow'], dtype=object)
from sklearn.neighbors import KNeighborsRegressor
df_dummies = pd.get_dummies(df.Color)  # one-hot 编码
stack_list = []
for col in df_dummies.columns:
    clf = KNeighborsRegressor(n_neighbors=6)
    clf.fit(df.iloc[:,:2], df_dummies[col])
    res = clf.predict([[0.8, -0.2]]).reshape(-1,1)
    stack_list.append(res)
stack_list >>> [array([[0.16666667]]), array([[0.33333333]]), array([[0.5]])]
np.hstack(stack_list)  # np.hstack()是把矩阵进行行连接
code_res = pd.Series(np.hstack(stack_list).argmax(1)) # 返回最大值的索引序号
df_dummies.columns[code_res[0]]
>>> 'Yellow'

第二问

from sklearn.neighbors import KNeighborsRegressor
df = pd.read_csv('data/audit.csv')
res_df = df.copy()

在这里插入图片描述

# onehot编码/最大最小归一化→拼接
df = pd.concat([pd.get_dummies(df[['Marital', 'Gender']]), df[['Age','Income','Hours']].apply(lambda x:(x-x.min())/(x.max()-x.min())), df.Employment],1)

在这里插入图片描述

X_train = df[df.Employment.notna()]  # 训练集
X_test = df[df.Employment.isna()]  # 测试集

在这里插入图片描述

df_dummies = pd.get_dummies(X_train.Employment)  # 独热编码
stack_list = []
for col in df_dummies.columns:
	clf = KNeighborsRegressor(n_neighbors=6)
	clf.fit(X_train.iloc[:,:-1], df_dummies[col])
	res = clf.predict(X_test.iloc[:,:-1]).reshape(-1,1)
	stack_list.append(res)

code_res = pd.Series(np.hstack(stack_list).argmax(1))  # 返回类型编码
cat_res = code_res.replace(dict(zip(list(range(df_dummies.shape[0])),df_dummies.columns)))
res_df.loc[res_df.Employment.isna(), 'Employment'] = cat_res.values  # 将预测值赋给NaN值，完成填充
res_df.isna().sum()