对Python-pandas中缺失值问题的归纳

最新推荐文章于 2022-11-25 10:13:37 发布

Maverick pig

最新推荐文章于 2022-11-25 10:13:37 发布

阅读量264

点赞数 1

文章标签： python 数据分析 pandas

本文链接：https://blog.csdn.net/MavPig/article/details/115743654

版权

import numpy as np
import pandas as pd

df = pd.DataFrame(dict(age=[5, 6, np.NaN],
                    born=[pd.NaT, pd.Timestamp('1939-05-27'),
                          pd.Timestamp('1940-04-25')],
                    name=['Alfred', 'Batman', ''],
                    toy=[None, 'Batmobile', 'Joker']))

‘’‘
age 	born	    name    toy
0	5.0	NaT	        Alfred  None
1	6.0	1939-05-27	Batman  Batmobile
2	NaN	1940-04-25		    Joker

‘’‘

DataFrame中单个数据的缺失值判断

pandas中单个缺失值实际上为np.NaN，
对于单个值x，不可直接用 x == np.NaN，可用 np.isnan(x)函数判断

df['age'][2] == np.NaN  False
df['age'][2] is np.NaN  False
np.isnan(df['age'][2])  True

获得DataFrame缺失值情况

df.isna()
# 返回一个由True,False构成的DataFrame,形状与df相同

#结果
	age  	born	name	toy
0	False	True	False	True
1	False	False	False	False
2	True	False	False	False

保留某列为空值的行

可综合 .loc[]与.isna()函数综合判断
先看df.loc函数的定义


``.loc[]`` is primarily label based, but may also be used with a
boolean array.

Allowed inputs are:

- A single label, e.g. ``5`` or ``'a'``, (note that ``5`` is
  interpreted as a *label* of the index, and **never** as an
  integer position along the index).
- A list or array of labels, e.g. ``['a', 'b', 'c']``.
- A slice object with labels, e.g. ``'a':'f'``.

  .. warning:: Note that contrary to usual python slices, **both** the
      start and the stop are included

- A boolean array of the same length as the axis being sliced,
  e.g. ``[True, False, True]``.
- An alignable boolean Series. The index of the key will be aligned before
  masking.
- An alignable Index. The Index of the returned selection will be the input.
- A ``callable`` function with one argument (the calling Series or
  DataFrame) and that returns valid output for indexing (one of the above)

从可接受输入的倒数第四点出发，只要得到需要判断空值的列的对应列表即可
故可凭借.isna()函数返回的数据帧,进一步选择对应列即可达到目的

#返回df中age为空值的行
df.loc[ df.isna()['age'] ]

#结果
	age	born	name	toy
2	NaN	1940-04-25		Joker

#特别注意pandas中DataFrame的很多操作并不修改其自身，只是返回对应值

保留不为空的行

直接使用.dropna()函数即可

df.dropna(subset = ['age'])

#结果
age	born	name	toy
0	5.0	NaT	Alfred	None
1	6.0	1939-05-27	Batman	Batmobile

统计每一列缺失值情况

探索性数据分析（EDA）中需要初步了解数据，缺失值情况是很重要的一点。

最简单的可用`.info()`函数

df.info()

#结果
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   age     2 non-null      float64       
 1   born    2 non-null      datetime64[ns]
 2   name    3 non-null      object        
 3   toy     2 non-null      object        
dtypes: datetime64[ns](1), float64(1), object(2)
memory usage: 224.0+ bytes

由每列的非空个数即可判断空值个数

`.isna()`函数结合`.sum()`画图

# nan可视化
missing = df.isna().sum()/len(df)
missing.sort_values(inplace=True)
missing.plot.bar()

missing_plot

Maverick pig

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
2
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫