datawhale的pandas学习第七章缺失数据

最新推荐文章于 2024-09-05 10:28:02 发布

减肥的卡比兽

最新推荐文章于 2024-09-05 10:28:02 发布

阅读量252

点赞数

分类专栏： datawhale的pandas学习文章标签： python 机器学习数据分析

本文链接：https://blog.csdn.net/zzj960321/article/details/112153900

版权

本文详细介绍了如何在Python的Pandas库中处理缺失数据，包括统计缺失值、删除、填充和插值方法。讲解了fillna()函数的使用，以及插值函数的参数详解，如limit参数。此外，还探讨了Nullable类型，特别是如何处理缺失值的计算和分组。文章通过实例和练习加深了读者的理解。

摘要由CSDN通过智能技术生成

导入所需模块

import numpy as np
import pandas as pd

一、缺失值的统计和删除

1. 缺失信息的统计

缺失数据可以使用isna或isnull（两个函数没有区别）来查看每个单元格是否缺失，结合mean可以计算出每列缺失值的比例：

df = pd.read_csv('../data/learn_pandas.csv', usecols = ['Grade', 'Name', 'Gender', 'Height', 'Weight', 'Transfer'])
df.isna().head()

Grade	Name	Gender	Height	Weight	Transfer
0	False	False	False	False	False	False
1	False	False	False	False	False	False
2	False	False	False	False	False	False
3	False	False	False	True	False	False
4	False	False	False	False	False	False

df.isna().mean() # 查看缺失的比例

Grade       0.000
Name        0.000
Gender      0.000
Height      0.085
Weight      0.055
Transfer    0.060
dtype: float64

如果想要查看某一列缺失或者非缺失的行，可以利用Series上的isna或者notna进行布尔索引。例如，查看身高缺失的行：

df[df.Height.isna()].head()

	Grade	Name	Gender	Height	Weight	Transfer
3	Sophomore	Xiaojuan Sun	Female	NaN	41.0	N
12	Senior	Peng You	Female	NaN	48.0	NaN
26	Junior	Yanli You	Female	NaN	48.0	N
36	Freshman	Xiaojuan Qin	Male	NaN	79.0	Y
60	Freshman	Yanpeng Lv	Male	NaN	65.0	N

如果想要同时对几个列，检索出全部为缺失或者至少有一个缺失或者没有缺失的行，可以使用isna, notna和any, all的组合。例如，对身高、体重和转系情况这3列分别进行这三种情况的检索：

sub_set = df[['Height', 'Weight', 'Transfer']]
df[sub_set.isna().all(1)] # 全部缺失

Grade	Name	Gender	Height	Weight	Transfer
102	Junior	Chengli Zhao	Male	NaN	NaN	NaN

df[sub_set.isna().any(1)].head() # 至少有一个缺失

	Grade	Name	Gender	Height	Weight	Transfer
3	Sophomore	Xiaojuan Sun	Female	NaN	41.0	N
9	Junior	Juan Xu	Female	164.8	NaN	N
12	Senior	Peng You	Female	NaN	48.0	NaN
21	Senior	Xiaopeng Shen	Male	166.0	62.0	NaN
26	Junior	Yanli You	Female	NaN	48.0	N

df[sub_set.notna().all(1)].head() # 没有缺失

Grade	Name	Gender	Height	Weight	Transfer
0	Freshman	Gaopeng Yang	Female	158.9	46.0	N
1	Freshman	Changqiang You	Male	166.5	70.0	N
2	Senior	Mei Sun	Male	188.9	89.0	N
4	Sophomore	Gaojuan You	Male	174.0	74.0	N
5	Freshman	Xiaoli Qian	Female	158.0	51.0	N

2. 缺失信息的删除

数据处理中经常需要根据缺失值的大小、比例或其他特征来进行行样本或列特征的删除，pandas中提供了dropna函数来进行操作。

dropna的主要参数为轴方向axis（默认为0，即删除行）、删除方式how、删除的非缺失值个数阈值thresh（非缺失值没有达到这个数量的相应维度会被删除）、备选的删除子集subset，其中how主要有any和all两种参数可以选择。

例如，删除身高体重至少有一个缺失的行：

在这里插入图片描述

例如，删除超过15个缺失值的列：

在这里插入图片描述

当然，不用dropna同样是可行的，例如上述的两个操作，也可以使用布尔索引来完成：

在这里插入图片描述

二、缺失值的填充和插值

1. 利用fillna进行填充

在fillna中有三个参数是常用的：value, method, limit。其中，value为填充值，可以是标量，也可以是索引到元素的字典映射；method为填充方法，有用前面的元素填充ffill和用后面的元素填充bfill两种类型，limit参数表示连续缺失值的最大填充次数。

下面构造一个简单的Series来说明用法：

在这里插入图片描述

有时为了更加合理地填充，需要先进行分组后再操作。例如，根据年级进行身高的均值填充：

【练一练1】

对一个序列以如下规则填充缺失值：如果单独出现的缺失值，就用前后均值填充，如果连续出现的缺失值就不填充，即序列[1, NaN, 3, NaN, NaN]填充后为[1, 2, 3, NaN, NaN]，请利用fillna函数实现。（提示：利用limit参数）

【解答】

在这里插入图片描述

【END】

2. 插值函数

在关于interpolate函数的文档描述中，列举了许多插值法，包括了大量Scipy中的方法。由于很多插值方法涉及到比较复杂的数学知识，因此这里只讨论比较常用且简单的三类情况，即线性插值、最近邻插值和索引插值。

DataFrame.interpolate(method='linear', axis=0, limit=None, inplace=False, limit_direction=None, limit_area=None, downcast=None, **kwargs)

对于interpolate而言，除了插值方法（默认为linear线性插值）之外，有与fillna类似的两个常用参数，一个是控制方向的limit_direction，另一个是控制最大连续缺失值插值个数的limit。其中，限制插值的方向默认为forward，这与fillna的method中的ffill是类似的，若想要后向限制插值或者双向限制插值可以指定为backward或both。

2.1 参数详解

Parameters

1.method:str, default ‘linear’

Interpolation technique to use. One of:

linear: Ignore the index and treat the values as equally spaced. This is the only method supported on MultiIndexes.

time: Works on daily and higher resolution data to interpolate given length of interval.

index, values: use the actual numerical values of the index.

pad: Fill in NaNs using existing values.

nearest, zero, slinear, quadratic, cubic, spline, barycentric, polynomial: Passed to scipy.interpolate.interp1d. These methods use the numerical values of the index. Both ‘polynomial’ and ‘spline’ require that you also specify an order (int), e.g. df.interpolate(method='polynomial', order=5).

krogh,piecewise_polynomial, spline, pchip, akima, cubicspline: Wrappers around the SciPy interpolation methods of similar names. See Notes.

from_derivatives: Refers to scipy.interpolate.BPoly.from_derivatives which replaces ‘piecewise_polynomial’ interpolation method in scipy 0.18.

2.axis:{ {0 or ‘index’, 1 or ‘columns’, None}}, default None

Axis to interpolate along.

3.limit:int, optional

Maximum number of consecutive NaNs to fill. Must be greater than 0.

4.inplace:bool, default False

Update the data in place if possible.

5.limit_direction:{ {‘forward’, ‘backward’, ‘both’}}, Optional

Consecutive NaNs will be filled in this direction.

If limit is specified:

If ‘method’ is ‘pad’ or ‘ffill’, ‘limit_direction’ must be ‘forward’.
If ‘method’ is ‘backfill’ or ‘bfill’, ‘limit_direction’ must be ‘backwards’.

If ‘limit’ is not specified:

If ‘method’ is ‘backfill’ or ‘bfill’, the default is ‘backward’
else the default is ‘forward’

Changed in version 1.1.0: raises ValueError if limit_direction is ‘forward’ or ‘both’ and method is ‘backfill’ or ‘bfill’. raises ValueError if limit_direction is ‘backward’ or ‘both’ and method is ‘pad’ or ‘ffill’.

6.limit_area{ {None, ‘inside’, ‘outside’}}, default None

If limit is specified, consecutive NaNs will be filled with this restriction.

None: No fill restriction.
‘inside’: Only fill NaNs surrounded by valid values (interpolate).
‘outside’: Only fill NaNs outside valid values (extrapolate).

7.downcast:optional, ‘infer’ or None, defaults to None

Downcast dtypes if possible.

2.2 示例

s = pd.Series([np.nan, np.nan, 1, np.nan, np.nan, np.nan, 2, np.nan, np.nan])
s.values
#array([nan, nan,  1., nan, nan, nan,  2., nan, nan]) 填充前序列

例如，在默认线性插值法下分别进行backward和双向限制插值，同时限制最大连续条数为1：

res = s.interpolate(limit_direction='backward', limit=1)
res.values
#array([ nan, 1.  , 1.  ,  nan,  nan, 1.75, 2.  ,  nan,  nan])
res = s.interpolate(limit_direction='both', limit=1)
res.values
#array([ nan, 1.  , 1.  , 1.25,  nan, 1.75, 2.  , 2.  ,  nan])

不使用limit参数：

在这里插入图片描述

method=‘pad’:

res = s.interpolate(method='pad')
res.values
#array([nan, nan,  1.,  1.,  1.,  1.,  2.,  2.,  2.])

从头开始遍历序列中每个元素，当出现不为nan的值之后，后面的nan值都用这个值填充，直到出现第二个不为nan的值。

第二种常见的插值是最近邻插补，即缺失值的元素和离它最近的非缺失值元素一样：

s.interpolate('nearest').values
#array([nan, nan,  1.,  1.,  1.,  2.,  2., nan, nan])

第三种介绍索引插值，即根据索引大小进行线性插值。例如，构造不等间距的索引进行演示：

s = pd.Series([0,np.nan,10],index=[0

最低0.47元/天解锁文章

减肥的卡比兽

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
datawhale的pandas学习第七章缺失数据

导入所需模块import numpy as npimport pandas as pd一、缺失值的统计和删除1. 缺失信息的统计缺失数据可以使用isna或isnull（两个函数没有区别）来查看每个单元格是否缺失，结合mean可以计算出每列缺失值的比例：df = pd.read_csv('../data/learn_pandas.csv', usecols = ['Grade', 'Name', 'Gender', 'Height', 'Weight', 'Transfer'])df.isna
复制链接

扫一扫