方法 说明 dropna 根据各标签的值是否存在缺失数据对轴标签进行过滤,可通过阈值调节对缺失值的容忍度 fillna 用指定值或插值方法(如ffill或bfill)填充缺失数据 isnull 返回一个含有布尔值的对象,这些布尔值表示哪些值是缺失值/NA,该对象的类型与源类型一致 notnull isnull的否定值
pandas 使用浮点值NaN(Not a Number)表示浮点和非浮点数组中的缺失数据
from pandas import Series, DataFrame
import numpy as np
string_data = Series( [ 'aardvark' , 'artichoke' , np. nan, 'avocado' ] )
string_data
0 aardvark
1 artichoke
2 NaN
3 avocado
dtype: object
is.null() 判断是否是缺失值,是NaN值返回 true
string_data. isnull( )
0 False
1 False
2 True
3 False
dtype: bool
python 内置 None
值会被当作NA处理
string_data[ 0 ] = None
string_data
0 None
1 artichoke
2 NaN
3 avocado
dtype: object
string_data. isnull( )
0 True
1 False
2 True
3 False
dtype: bool
滤除缺失数据(dropna)
from numpy import nan as NA
data = Series( [ 1 , NA, 3.5 , NA, 7 ] )
data
0 1.0
1 NaN
2 3.5
3 NaN
4 7.0
dtype: float64
data. dropna( )
0 1.0
2 3.5
4 7.0
dtype: float64
data[ data. notnull( ) ]
0 1.0
2 3.5
4 7.0
dtype: float64
针对DataFrame对象,dropna 默认丢弃任何含有NA缺失值的行
data = DataFrame( [ [ 1 , 6.5 , 3 ] , [ 1 , NA, NA] , [ NA, NA, NA] , [ NA, 6.5 , 3 ] ] )
data
0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3.0
data. dropna( )
参数 how='all'
丢弃全部为NA的行
data. dropna( how= 'all' )
0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 3 NaN 6.5 3.0
参数 axis=1,how='all'
丢弃全部为NA的列
data[ 4 ] = NA
data
0 1 2 4 0 1.0 6.5 3.0 NaN 1 1.0 NaN NaN NaN 2 NaN NaN NaN NaN 3 NaN 6.5 3.0 NaN
data. dropna( axis= 1 , how= 'all' )
0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3.0
df = DataFrame( np. random. randn( 7 , 3 ) )
df. ix[ : 4 , 1 ] = NA
df. ix[ : 2 , 2 ] = NA
df
0 1 2 0 0.315068 NaN NaN 1 0.455580 NaN NaN 2 0.207725 NaN NaN 3 0.363081 NaN -0.931486 4 0.372784 NaN -1.448125 5 0.571531 2.717854 -1.289343 6 -0.627056 -0.035343 -0.632109
dropna(thresh=n) 保留至少有n个非NaN数据的行/列
df. dropna( thresh= 1 )
0 1 2 0 0.315068 NaN NaN 1 0.455580 NaN NaN 2 0.207725 NaN NaN 3 0.363081 NaN -0.931486 4 0.372784 NaN -1.448125 5 0.571531 2.717854 -1.289343 6 -0.627056 -0.035343 -0.632109
df. dropna( thresh= 2 )
0 1 2 3 0.363081 NaN -0.931486 4 0.372784 NaN -1.448125 5 0.571531 2.717854 -1.289343 6 -0.627056 -0.035343 -0.632109
df. dropna( thresh= 3 )
0 1 2 5 0.571531 2.717854 -1.289343 6 -0.627056 -0.035343 -0.632109
填充缺失数据 (fillna)
默认会返回新对象
参数 说明 value 用于填充缺失值的标量值或字典对象 method 插值方法。如果函数调用时未指定其他参数,默认“ffill” axis 待填充的轴。默认axis=0(行) inplace 修改调用者对象而不产生副本 limit (对于前向和后向填充)可以连续填充的最大数量
df = DataFrame( np. random. randn( 7 , 3 ) )
df. ix[ : 4 , 1 ] = NA
df. ix[ : 2 , 2 ] = NA
df
0 1 2 0 0.506939 NaN NaN 1 0.651995 NaN NaN 2 0.645202 NaN NaN 3 -0.019539 NaN 0.026027 4 0.044561 NaN 0.448722 5 -2.916266 -0.503103 1.042819 6 -0.085931 2.058093 1.255430
df. fillna( 0 )
0 1 2 0 0.506939 0.000000 0.000000 1 0.651995 0.000000 0.000000 2 0.645202 0.000000 0.000000 3 -0.019539 0.000000 0.026027 4 0.044561 0.000000 0.448722 5 -2.916266 -0.503103 1.042819 6 -0.085931 2.058093 1.255430
df
0 1 2 0 0.506939 NaN NaN 1 0.651995 NaN NaN 2 0.645202 NaN NaN 3 -0.019539 NaN 0.026027 4 0.044561 NaN 0.448722 5 -2.916266 -0.503103 1.042819 6 -0.085931 2.058093 1.255430
传入一个字典,实现对不同的列填充不同的值
df. fillna( { 1 : 0.5 , 2 : - 1 } )
0 1 2 0 0.506939 0.500000 -1.000000 1 0.651995 0.500000 -1.000000 2 0.645202 0.500000 -1.000000 3 -0.019539 0.500000 0.026027 4 0.044561 0.500000 0.448722 5 -2.916266 -0.503103 1.042819 6 -0.085931 2.058093 1.255430
df
0 1 2 0 0.506939 NaN NaN 1 0.651995 NaN NaN 2 0.645202 NaN NaN 3 -0.019539 NaN 0.026027 4 0.044561 NaN 0.448722 5 -2.916266 -0.503103 1.042819 6 -0.085931 2.058093 1.255430
参数inplace=True
fillna默认会返回新对象,也可以对现有对象进行就地修改
_ = df. fillna( 0 , inplace= True )
df
0 1 2 0 0.506939 0.000000 0.000000 1 0.651995 0.000000 0.000000 2 0.645202 0.000000 0.000000 3 -0.019539 0.000000 0.026027 4 0.044561 0.000000 0.448722 5 -2.916266 -0.503103 1.042819 6 -0.085931 2.058093 1.255430
参数method
前向插值方法,与参数limit
df = DataFrame( np. random. randn( 6 , 3 ) )
df. ix[ 2 : , 1 ] = NA
df. ix[ 4 : , 2 ] = NA
df
/Users/wuyihong/anaconda2/envs/python35/lib/python3.5/site-packages/ipykernel/__main__.py:2: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
from ipykernel import kernelapp as app
/Users/wuyihong/anaconda2/envs/python35/lib/python3.5/site-packages/ipykernel/__main__.py:3: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
app.launch_new_instance()
0 1 2 0 -1.791661 -1.082456 0.119847 1 -0.998142 -1.814483 0.535919 2 -1.240185 NaN -0.080930 3 -0.472280 NaN 0.267896 4 0.135498 NaN NaN 5 -0.386448 NaN NaN
df. fillna( method= 'ffill' )
0 1 2 0 -1.791661 -1.082456 0.119847 1 -0.998142 -1.814483 0.535919 2 -1.240185 -1.814483 -0.080930 3 -0.472280 -1.814483 0.267896 4 0.135498 -1.814483 0.267896 5 -0.386448 -1.814483 0.267896
df. fillna( method= 'ffill' , limit= 2 )
0 1 2 0 -1.791661 -1.082456 0.119847 1 -0.998142 -1.814483 0.535919 2 -1.240185 -1.814483 -0.080930 3 -0.472280 -1.814483 0.267896 4 0.135498 NaN 0.267896 5 -0.386448 NaN 0.267896
data = Series( [ 1 , NA, 3.5 , NA, 7 ] )
data
0 1.0
1 NaN
2 3.5
3 NaN
4 7.0
dtype: float64
data. mean( )
3.8333333333333335
data. fillna( data. mean( ) )
0 1.000000
1 3.833333
2 3.500000
3 3.833333
4 7.000000
dtype: float64