本文翻译自:How to count the NaN values in a column in pandas DataFrame
I have data, in which I want to find number of NaN
, so that if it is less than some threshold, I will drop this columns. 我有数据,我想在其中查找NaN
数量,以便如果它小于某个阈值,我将删除此列。 I looked, but didn't able to find any function for this. 我看了一下,但是找不到任何功能。 there is value_counts
, but it would be slow for me, because most of values are distinct and I want count of NaN
only. 有value_counts
,但是对我来说会很慢,因为大多数值是不同的,并且我只想计算NaN
。
#1楼
参考:https://stackoom.com/question/1mD50/如何计算pandas-DataFrame列中的NaN值
#2楼
You could subtract the total length from the count of non-nan values: 您可以从非Nan值的计数中减去总长度:
count_nan = len(df) - df.count()
You should time it on your data. 您应该在数据上计时。 For small Series got a 3x speed up in comparison with the isnull
solution. 与isnull
解决方案相比,小型系列的速度提高了3倍。
#3楼
You can use the isna()
method (or it's alias isnull()
which is also compatible with older pandas versions < 0.21.0) and then sum to count the NaN values. 您可以使用isna()
方法(或者它的别名isnull()
也与<0.21.0的旧版熊猫兼容),然后求和以计算NaN值。 For one column: 对于一列:
In [1]: s = pd.Series([1,2,3, np.nan, np.nan])
In [4]: s.isna().sum() # or s.isnull().sum() for older pandas versions
Out[4]: 2
For several columns, it also works: 对于几列,它也适用:
In [5]: df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})
In [6]: df.isna().sum()
Out[6]:
a 1
b 2
dtype: int64
#4楼
Since pandas 0.14.1 my suggestion here to have a keyword argument in the value_counts method has been implemented: 由于大熊猫0.14.1我的建议在这里有在value_counts方法的关键字参数已经实现:
import pandas as pd
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})
for col in df:
print df[col].value_counts(dropna=False)
2 1
1 1
NaN 1
dtype: int64
NaN 2
1 1
dtype: int64
#5楼
if you are using Jupyter Notebook, How about.... 如果您正在使用Jupyter Notebook,如何...。
%%timeit
df.isnull().any().any()
or 要么
%timeit
df.isnull().values.sum()
or, are there anywhere NaNs in the data, if yes, where? 或者,数据中是否存在NaN,如果是,在哪里?
df.isnull().any()
#6楼
Based on the most voted answer we can easily define a function that gives us a dataframe to preview the missing values and the % of missing values in each column: 根据投票最多的答案,我们可以轻松定义一个函数,该函数为我们提供一个数据框,以预览每列中的缺失值和缺失值的百分比:
def missing_values_table(df):
mis_val = df.isnull().sum()
mis_val_percent = 100 * df.isnull().sum() / len(df)
mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
mis_val_table_ren_columns = mis_val_table.rename(
columns = {0 : 'Missing Values', 1 : '% of Total Values'})
mis_val_table_ren_columns = mis_val_table_ren_columns[
mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
'% of Total Values', ascending=False).round(1)
print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"
"There are " + str(mis_val_table_ren_columns.shape[0]) +
" columns that have missing values.")
return mis_val_table_ren_columns