如何检查Pandas DataFrame中的任何值是否为NaN

在Python Pandas中,检查DataFrame是否包含NaN值有多种方法。可以使用`isnull().any()`返回布尔值,或者使用`isnull().sum()`得到NaN值的总数。还有更快速的方法如使用numpy的`np.sum(df.isnull())`。此外,通过设置`dropna=False`在EDA过程中获取每列的值计数,但适用于分类变量。若要找出包含至少一个NaN值的行数,可以使用`df[ df.isnull().any(axis=1) ].shape[0]`。
摘要由CSDN通过智能技术生成

本文翻译自:How to check if any value is NaN in a Pandas DataFrame

In Python Pandas, what's the best way to check whether a DataFrame has one (or more) NaN values? 在Python Pandas中,检查DataFrame是否具有一个(或多个)NaN值的最佳方法是什么?

I know about the function pd.isnan , but this returns a DataFrame of booleans for each element. 我知道函数pd.isnan ,但是这会为每个元素返回一个布尔数据框架。 This post right here doesn't exactly answer my question either. 这篇文章也没有完全回答我的问题。


#1楼

参考:https://stackoom.com/question/1zuA4/如何检查Pandas-DataFrame中的任何值是否为NaN


#2楼

df.isnull().any().any()应该这样做。


#3楼

You have a couple of options. 你有几个选择。

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10,6))
# Make a few areas have NaN values
df.iloc[1:3,1] = np.nan
df.iloc[5,3] = np.nan
df.iloc[7:9,5] = np.nan

Now the data frame looks something like this: 现在数据框看起来像这样:

          0         1         2         3         4         5
0  0.520113  0.884000  1.260966 -0.236597  0.312972 -0.196281
1 -0.837552       NaN  0.143017  0.862355  0.346550  0.842952
2 -0.452595       NaN -0.420790  0.456215  1.203459  0.527425
3  0.317503 -0.917042  1.780938 -1.584102  0.432745  0.389797
4 -0.722852  1.704820 -0.113821 -1.466458  0.083002  0.011722
5 -0.622851 -0.251935 -1.498837       NaN  1.098323  0.273814
6  0.329585  0.075312 -0.690209 -3.807924  0.489317 -0.841368
7 -1.123433 -1.187496  1.868894 -2.046456 -0.949718       NaN
8  1.133880 -0.110447  0.050385 -1.158387  0.188222       NaN
9 -0.513741  1.196259  0.704537  0.982395 -0.585040 -1.693810
  • Option 1 : df.isnull().any().any() - This returns a boolean value 选项1df.isnull().any().any() - 返回一个布尔值

You know of the isnull() which would return a dataframe like this: 你知道isnull()会返回一个像这样的数据帧:

       0      1      2      3      4      5
0  False  False  False  False  False  False
1  False   True  False  False  False  False
2  False   True  False  False  False  False
3  False  False  False  False  False  False
4  False  False  False  False  False  False
5  False  False  False   True  False  False
6  False  False  False  False  False  False
7  False  False  False  False  False   True
8  False  False  False  False  False   True
9  False  False  False  False  False  False

If you make it df.isnull().any() , you can find just the columns that have NaN values: 如果你将它df.isnull().any() ,你只能找到具有NaN值的列:

0    False
1     True
2    False
3     True
4    False
5     True
dtype: bool

One more .any() will tell you if any of the above are True 还有一个.any()会告诉你上面的任何一个是否为True

> df.isnull().any().any()
True
  • Option 2 : df.isnull().sum().sum() - This returns an integer of the total number of NaN values: 选项2df.isnull().sum().sum() - 返回NaN值总数的整数:

This operates the same way as the .any().any() does, by first giving a summation of the number of NaN values in a column, then the summation of those values: 这与.any().any()操作方式相同,首先给出一列中NaN值的总和,然后是这些值的总和:

df.isnull().sum()
0    0
1    2
2    0
3    1
4    0
5    2
dtype: int64

Finally, to get the total number of NaN values in the DataFrame: 最后,要获取DataFrame中NaN值的总数:

df.isnull().sum().sum()
5

#4楼

jwilner 's response is spot on. jwilner的反应很明显。 I was exploring to see if there's a faster option, since in my experience, summing flat arrays is (strangely) faster than counting. 我正在探索是否有更快的选择,因为根据我的经验,求平面阵列(奇怪地)比计数更快。 This code seems faster: 这段代码似乎更快:

df.isnull().values.any()

For example: 例如:

In [2]: df = pd.DataFrame(np.random.randn(1000,1000))

In [3]: df[df > 0.9] = pd.np.nan

In [4]: %timeit df.isnull().any().any()
100 loops, best of 3: 14.7 ms per loop

In [5]: %timeit df.isnull().values.sum()
100 loops, best of 3: 2.15 ms per loop

In [6]: %timeit df.isnull().sum().sum()
100 loops, best of 3: 18 ms per loop

In [7]: %timeit df.isnull().values.any()
1000 loops, best of 3: 948 µs per loop

df.isnull().sum().sum() is a bit slower, but of course, has additional information -- the number of NaNs . df.isnull().sum().sum()是有点慢,但是当然有附加信息-的数目NaNs


#5楼

Depending on the type of data you're dealing with, you could also just get the value counts of each column while performing your EDA by setting dropna to False. 根据您正在处理的数据类型,您还可以通过将dropna设置为False来获取执行EDA时每列的值计数。

for col in df:
   print df[col].value_counts(dropna=False)

Works well for categorical variables, not so much when you have many unique values. 适用于分类变量,而不是在有许多唯一值时。


#6楼

If you need to know how many rows there are with "one or more NaN s": 如果您需要知道“一个或多个NaN ”有多少行:

df.isnull().T.any().T.sum()

Or if you need to pull out these rows and examine them: 或者,如果您需要提取这些行并检查它们:

nan_rows = df[df.isnull().T.any().T]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值