目录
import numpy as np
import pandas as pd
df = pd.DataFrame(dict(age=[5, 6, np.NaN],
born=[pd.NaT, pd.Timestamp('1939-05-27'),
pd.Timestamp('1940-04-25')],
name=['Alfred', 'Batman', ''],
toy=[None, 'Batmobile', 'Joker']))
‘’‘
age born name toy
0 5.0 NaT Alfred None
1 6.0 1939-05-27 Batman Batmobile
2 NaN 1940-04-25 Joker
‘’‘
DataFrame中单个数据的缺失值判断
pandas中单个缺失值实际上为np.NaN
,
对于单个值x,不可直接用 x == np.NaN
,可用 np.isnan(x)
函数判断
df['age'][2] == np.NaN False
df['age'][2] is np.NaN False
np.isnan(df['age'][2]) True
获得DataFrame缺失值情况
df.isna()
# 返回一个由True,False构成的DataFrame,形状与df相同
#结果
age born name toy
0 False True False True
1 False False False False
2 True False False False
保留某列为空值的行
可综合 .loc[]
与.isna()
函数综合判断
先看df.loc函数的定义
``.loc[]`` is primarily label based, but may also be used with a
boolean array.
Allowed inputs are:
- A single label, e.g. ``5`` or ``'a'``, (note that ``5`` is
interpreted as a *label* of the index, and **never** as an
integer position along the index).
- A list or array of labels, e.g. ``['a', 'b', 'c']``.
- A slice object with labels, e.g. ``'a':'f'``.
.. warning:: Note that contrary to usual python slices, **both** the
start and the stop are included
- A boolean array of the same length as the axis being sliced,
e.g. ``[True, False, True]``.
- An alignable boolean Series. The index of the key will be aligned before
masking.
- An alignable Index. The Index of the returned selection will be the input.
- A ``callable`` function with one argument (the calling Series or
DataFrame) and that returns valid output for indexing (one of the above)
从可接受输入的倒数第四点出发,只要得到需要判断空值的列的对应列表即可
故可凭借.isna()
函数返回的数据帧,进一步选择对应列即可达到目的
#返回df中age为空值的行
df.loc[ df.isna()['age'] ]
#结果
age born name toy
2 NaN 1940-04-25 Joker
#特别注意pandas中DataFrame的很多操作并不修改其自身,只是返回对应值
保留不为空的行
直接使用.dropna()
函数即可
df.dropna(subset = ['age'])
#结果
age born name toy
0 5.0 NaT Alfred None
1 6.0 1939-05-27 Batman Batmobile
统计每一列缺失值情况
探索性数据分析(EDA)中需要初步了解数据,缺失值情况是很重要的一点。
最简单的可用.info()
函数
df.info()
#结果
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 2 non-null float64
1 born 2 non-null datetime64[ns]
2 name 3 non-null object
3 toy 2 non-null object
dtypes: datetime64[ns](1), float64(1), object(2)
memory usage: 224.0+ bytes
由每列的非空个数即可判断空值个数
.isna()
函数结合.sum()
画图
# nan可视化
missing = df.isna().sum()/len(df)
missing.sort_values(inplace=True)
missing.plot.bar()