对Python-pandas中缺失值问题的归纳

import numpy as np
import pandas as pd

df = pd.DataFrame(dict(age=[5, 6, np.NaN],
                    born=[pd.NaT, pd.Timestamp('1939-05-27'),
                          pd.Timestamp('1940-04-25')],
                    name=['Alfred', 'Batman', ''],
                    toy=[None, 'Batmobile', 'Joker']))

‘’‘
age 	born	    name    toy
0	5.0	NaT	        Alfred  None
1	6.0	1939-05-27	Batman  Batmobile
2	NaN	1940-04-25		    Joker

‘’‘                    

DataFrame中单个数据的缺失值判断

pandas中单个缺失值实际上为np.NaN
对于单个值x,不可直接用 x == np.NaN,可用 np.isnan(x)函数判断

df['age'][2] == np.NaN  False
df['age'][2] is np.NaN  False
np.isnan(df['age'][2])  True

获得DataFrame缺失值情况

df.isna()
# 返回一个由True,False构成的DataFrame,形状与df相同

#结果
	age  	born	name	toy
0	False	True	False	True
1	False	False	False	False
2	True	False	False	False


保留某列为空值的行

可综合 .loc[].isna()函数综合判断
先看df.loc函数的定义


``.loc[]`` is primarily label based, but may also be used with a
boolean array.

Allowed inputs are:

- A single label, e.g. ``5`` or ``'a'``, (note that ``5`` is
  interpreted as a *label* of the index, and **never** as an
  integer position along the index).
- A list or array of labels, e.g. ``['a', 'b', 'c']``.
- A slice object with labels, e.g. ``'a':'f'``.

  .. warning:: Note that contrary to usual python slices, **both** the
      start and the stop are included

- A boolean array of the same length as the axis being sliced,
  e.g. ``[True, False, True]``.
- An alignable boolean Series. The index of the key will be aligned before
  masking.
- An alignable Index. The Index of the returned selection will be the input.
- A ``callable`` function with one argument (the calling Series or
  DataFrame) and that returns valid output for indexing (one of the above)


从可接受输入的倒数第四点出发,只要得到需要判断空值的列的对应列表即可
故可凭借.isna()函数返回的数据帧,进一步选择对应列即可达到目的

#返回df中age为空值的行
df.loc[ df.isna()['age'] ]

#结果
	age	born	name	toy
2	NaN	1940-04-25		Joker

#特别注意pandas中DataFrame的很多操作并不修改其自身,只是返回对应值

保留不为空的行

直接使用.dropna()函数即可

df.dropna(subset = ['age'])

#结果
age	born	name	toy
0	5.0	NaT	Alfred	None
1	6.0	1939-05-27	Batman	Batmobile


统计每一列缺失值情况

探索性数据分析(EDA)中需要初步了解数据,缺失值情况是很重要的一点。

最简单的可用.info()函数

df.info()

#结果
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   age     2 non-null      float64       
 1   born    2 non-null      datetime64[ns]
 2   name    3 non-null      object        
 3   toy     2 non-null      object        
dtypes: datetime64[ns](1), float64(1), object(2)
memory usage: 224.0+ bytes

由每列的非空个数即可判断空值个数

.isna()函数结合.sum()画图

# nan可视化
missing = df.isna().sum()/len(df)
missing.sort_values(inplace=True)
missing.plot.bar()


missing_plot

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值