最近做一个系列博客,跟着stackoverflow学Pandas。
专栏地址:http://blog.csdn.net/column/details/16726.html
以 pandas作为关键词,在stackoverflow中进行搜索,随后安照 votes 数目进行排序:
https://stackoverflow.com/questions/tagged/pandas?sort=votes&pageSize=15
How to drop rows of Pandas DataFrame whose value in certain columns is NaN - 删除带有NaN的行
数据准备
我们随机生成了10x3列的数据,然后针对某些数据赋值 NaN。
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10,3), columns=['col1', 'col2', 'col3'])
df.iloc[::2,0] = np.nan
df.iloc[::4,1] = np.nan
df.iloc[::3,2] = np.nan
print df
# col1 col2 col3
# 0 NaN NaN NaN
# 1 -0.498336 -0.960804 0.705309
# 2 NaN -2.120032 2.123329
# 3 0.791883 -0.283840 NaN
# 4 NaN NaN -1.241788
# 5 -0.399644 -0.968515 -1.509056
# 6 NaN 0.897637 NaN
# 7 1.826128 1.015091 -0.497022
# 8 NaN NaN -1.889871
# 9 0.379287 -1.762229 NaN
pandas.notnull
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.notnull.html
可以接受Series 或者 DataFrame 类型的数据
pandas.notnull 被设计用来取代 np.isfinite / numpy.isnan
pd.notnull(df['col1'])
# 0 False
# 1 True
# 2 False
# 3 True
# 4 False
# 5 True
# 6 False
# 7 True
# 8 False
# 9 True
# Name: col1, dtype: bool
print pd.notnull(df)
# col1 col2 col3
# 0 False False False
# 1 True True True
# 2 False True True
# 3 True True False
# 4 False False True
# 5 True True True
# 6 False True False
# 7 True True True
# 8 False False True
# 9 True True False
np.isfinite / numpy.isnan
np.isfinite 会对数据进行判断,如果是有限数据返回True。我们可以通过对不同列的bool值组合来满足我们的取值要求。
numpy.isnan 判断是否是NaN
np.isfinite(df['col1'])
# 1 True
# 3 True
# 5 True
# 7 True
# 9 True
# Name: col1, dtype: bool
df1 = df[np.isfinite(df['col1'])]
print df1
# col1 col2 col3
# 1 -0.498336 -0.960804 0.705309
# 3 0.791883 -0.283840 NaN
# 5 -0.399644 -0.968515 -1.509056
# 7 1.826128 1.015091 -0.497022
# 9 0.379287 -1.762229 NaN
drop
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html
drop 可以接受多个参数:
axis : {0 or ‘index’, 1 or ‘columns’}, or tuple/list thereof
Pass tuple or list to drop on multiple axes
how : {‘any’, ‘all’}
any : if any NA values are present, drop that label
all : if all values are NA, drop that label
thresh : int, default None
int value : require that many non-NA values
subset : array-like
Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include
inplace : boolean, default False
If True, do operation inplace and return None.
# 默认是删除有NaN的行
print df.dropna()
# col1 col2 col3
# 1 1.944899 -1.792510 -0.612904
# 5 -0.609380 1.087689 -1.145582
# 7 -2.045037 1.043837 0.429135
print df.dropna(how='all') #删除全部是NaN的行
# col1 col2 col3
# 1 1.944899 -1.792510 -0.612904
# 2 NaN 0.780487 -1.239197
# 3 -1.050320 -0.121033 NaN
# 4 NaN NaN -0.537213
# 5 -0.609380 1.087689 -1.145582
# 6 NaN -0.721761 NaN
# 7 -2.045037 1.043837 0.429135
# 8 NaN NaN -0.096989
# 9 1.514520 0.224193 NaN
更多的可以参考,drop的官方说明。