pd.DataFrame 缺失数据的处理

gitee 

 

 

pd.notnull(df)

pd.isnull(df)

df.dropna(axis=0,how='any',inplace=False) 默认

df.dropna(axis=1,how='all',inplace=True) 对df本身产生影响
df.sort_values(by='Z',ascending=False)
In [2]: import pandas as pd

In [3]: import numpy as np

In [4]: df = pd.DataFrame(np.arange(12).reshape((3,4)),index=list('abc'),columns=list('WXYZ')    )

In [5]: df
Out[5]:
   W  X   Y   Z
a  0  1   2   3
b  4  5   6   7
c  8  9  10  11

In [6]: df[df==0] = np.nan

In [7]: df
Out[7]:
     W  X   Y   Z
a  NaN  1   2   3
b  4.0  5   6   7
c  8.0  9  10  11

In [8]: df.dropna()
Out[8]:
     W  X   Y   Z
b  4.0  5   6   7
c  8.0  9  10  11

In [9]: df.dropna(axis=1)
Out[9]:
   X   Y   Z
a  1   2   3
b  5   6   7
c  9  10  11

In [10]: df.dropna(axis=0,how='all')
Out[10]:
     W  X   Y   Z
a  NaN  1   2   3
b  4.0  5   6   7
c  8.0  9  10  11

In [11]: df.dropna(axis=1,how='all')
Out[11]:
     W  X   Y   Z
a  NaN  1   2   3
b  4.0  5   6   7
c  8.0  9  10  11
In [12]: pd.isnull(df)
Out[12]:
       W      X      Y      Z
a   True  False  False  False
b  False  False  False  False
c  False  False  False  False

In [13]: pd.isnull(df['W'])
Out[13]:
a     True
b    False
c    False
Name: W, dtype: bool

In [14]: df[  pd.isnull(df['W']) ]
Out[14]:
    W  X  Y  Z
a NaN  1  2  3
In [22]: import numpy as np

In [23]: import pandas as pd

In [24]: df = pd.DataFrame(np.arange(12).reshape((3,4)),index=list('abc'),columns=list('WXYZ') )

In [25]: df
Out[25]:
   W  X   Y   Z
a  0  1   2   3
b  4  5   6   7
c  8  9  10  11

In [26]: df[df==0] = np.nan

In [27]: df
Out[27]:
     W  X   Y   Z
a  NaN  1   2   3
b  4.0  5   6   7
c  8.0  9  10  11

In [28]: df.fillna(0)
Out[28]:
     W  X   Y   Z
a  0.0  1   2   3
b  4.0  5   6   7
c  8.0  9  10  11

In [29]: df.fillna(100)
Out[29]:
       W  X   Y   Z
a  100.0  1   2   3
b    4.0  5   6   7
c    8.0  9  10  11

对于缺失值的填充一般不填充一个具体的数据,一般用 均值 或者 中位数填充

1.对于有些列 填充可能没有什么实际意义

2.对于有些列则 填充有意义

In [18]: df
Out[18]:
   W  X   Y   Z
a  0  1   2   3
b  4  5   6   7
c  8  9  10  11

In [19]: df.mean()
Out[19]:
W    4.0
X    5.0
Y    6.0
Z    7.0
dtype: float64

In [25]: df.median()
Out[25]:
W    4.0
X    5.0
Y    6.0
Z    7.0
dtype: float64

In [26]: df.loc[0,:]=np.nan

In [27]: df
Out[27]:
     W    X     Y     Z
a  0.0  1.0   2.0   3.0
b  4.0  5.0   6.0   7.0
c  8.0  9.0  10.0  11.0
0  NaN  NaN   NaN   NaN

In [28]: df.dropna(inplace=True)

In [29]: df
Out[29]:
     W    X     Y     Z
a  0.0  1.0   2.0   3.0
b  4.0  5.0   6.0   7.0
c  8.0  9.0  10.0  11.0

In [30]: df.loc['a',:] = np.nan

In [31]: df
Out[31]:
     W    X     Y     Z
a  NaN  NaN   NaN   NaN
b  4.0  5.0   6.0   7.0
c  8.0  9.0  10.0  11.0

In [32]: df.fillna( df.mean() )
Out[32]:
     W    X     Y     Z
a  6.0  7.0   8.0   9.0
b  4.0  5.0   6.0   7.0
c  8.0  9.0  10.0  11.0

In [33]: df
Out[33]:
     W    X     Y     Z
a  NaN  NaN   NaN   NaN
b  4.0  5.0   6.0   7.0
c  8.0  9.0  10.0  11.0

In [34]: df.fillna(df.median())
Out[34]:
     W    X     Y     Z
a  6.0  7.0   8.0   9.0
b  4.0  5.0   6.0   7.0
c  8.0  9.0  10.0  11.0

In [35]: df
Out[35]:
     W    X     Y     Z
a  NaN  NaN   NaN   NaN
b  4.0  5.0   6.0   7.0
c  8.0  9.0  10.0  11.0

In [36]: df['W'].fillna(df['W'].mean())
Out[36]:
a    6.0
b    4.0
c    8.0
Name: W, dtype: float64

In [37]: df
Out[37]:
     W    X     Y     Z
a  NaN  NaN   NaN   NaN
b  4.0  5.0   6.0   7.0
c  8.0  9.0  10.0  11.0

In [38]: df['W'].fillna( df['W'].median() )
Out[38]:
a    6.0
b    4.0
c    8.0
Name: W, dtype: float64

In [39]: df
Out[39]:
     W    X     Y     Z
a  NaN  NaN   NaN   NaN
b  4.0  5.0   6.0   7.0
c  8.0  9.0  10.0  11.0

In [40]: df['W'] = 1

In [41]: df
Out[41]:
   W    X     Y     Z
a  1  NaN   NaN   NaN
b  1  5.0   6.0   7.0
c  1  9.0  10.0  11.0

In [42]: df['W'] = df['W'].fillna( df['W'].mean() )

In [43]: df
Out[43]:
   W    X     Y     Z
a  1  NaN   NaN   NaN
b  1  5.0   6.0   7.0
c  1  9.0  10.0  11.0

DataFrame 的  loc 与 iloc 的区别: 

In [15]: df.iloc[[0,2],:]
Out[15]:
   W  X   Y   Z
a  0  1   2   3
c  8  9  10  11

In [16]: df.loc[:,'W']
Out[16]:
a    0
b    4
c    8
Name: W, dtype: int32

In [17]: df.loc[:,['W','Z']]
Out[17]:
   W   Z
a  0   3
b  4   7
c  8  11

 

 

评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值