Series和DataFrame--处理缺失数据(NA)

方法说明
dropna根据各标签的值是否存在缺失数据对轴标签进行过滤,可通过阈值调节对缺失值的容忍度
fillna用指定值或插值方法(如ffill或bfill)填充缺失数据
isnull返回一个含有布尔值的对象,这些布尔值表示哪些值是缺失值/NA,该对象的类型与源类型一致
notnullisnull的否定值

pandas 使用浮点值NaN(Not a Number)表示浮点和非浮点数组中的缺失数据

from pandas import Series,DataFrame
import numpy as np
string_data = Series(['aardvark','artichoke',np.nan,'avocado'])
string_data
0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

is.null() 判断是否是缺失值,是NaN值返回 true

string_data.isnull()
0    False
1    False
2     True
3    False
dtype: bool

python 内置 None 值会被当作NA处理

string_data[0] = None
string_data
0         None
1    artichoke
2          NaN
3      avocado
dtype: object
string_data.isnull()
0     True
1    False
2     True
3    False
dtype: bool

滤除缺失数据(dropna)

from numpy import nan as NA
data = Series([1,NA,3.5,NA,7])
data
0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64
data.dropna()
0    1.0
2    3.5
4    7.0
dtype: float64
data[data.notnull()]
0    1.0
2    3.5
4    7.0
dtype: float64

针对DataFrame对象,dropna 默认丢弃任何含有NA缺失值的行

data = DataFrame([[1,6.5,3],[1,NA,NA],[NA,NA,NA],[NA,6.5,3]])
data
012
01.06.53.0
11.0NaNNaN
2NaNNaNNaN
3NaN6.53.0
data.dropna()
012
01.06.53.0

参数 how='all' 丢弃全部为NA的行

data.dropna(how='all')
012
01.06.53.0
11.0NaNNaN
3NaN6.53.0

参数 axis=1,how='all' 丢弃全部为NA的列

data[4]=NA
data
0124
01.06.53.0NaN
11.0NaNNaNNaN
2NaNNaNNaNNaN
3NaN6.53.0NaN
data.dropna(axis=1,how='all')
012
01.06.53.0
11.0NaNNaN
2NaNNaNNaN
3NaN6.53.0
df = DataFrame(np.random.randn(7,3))
df.ix[:4,1]=NA
df.ix[:2,2]=NA
df
012
00.315068NaNNaN
10.455580NaNNaN
20.207725NaNNaN
30.363081NaN-0.931486
40.372784NaN-1.448125
50.5715312.717854-1.289343
6-0.627056-0.035343-0.632109

dropna(thresh=n) 保留至少有n个非NaN数据的行/列

df.dropna(thresh=1)
012
00.315068NaNNaN
10.455580NaNNaN
20.207725NaNNaN
30.363081NaN-0.931486
40.372784NaN-1.448125
50.5715312.717854-1.289343
6-0.627056-0.035343-0.632109
df.dropna(thresh=2)
012
30.363081NaN-0.931486
40.372784NaN-1.448125
50.5715312.717854-1.289343
6-0.627056-0.035343-0.632109
df.dropna(thresh=3)
012
50.5715312.717854-1.289343
6-0.627056-0.035343-0.632109

填充缺失数据 (fillna)

默认会返回新对象

参数说明
value用于填充缺失值的标量值或字典对象
method插值方法。如果函数调用时未指定其他参数,默认“ffill”
axis待填充的轴。默认axis=0(行)
inplace修改调用者对象而不产生副本
limit(对于前向和后向填充)可以连续填充的最大数量
df = DataFrame(np.random.randn(7,3))
df.ix[:4,1]=NA
df.ix[:2,2]=NA
df
012
00.506939NaNNaN
10.651995NaNNaN
20.645202NaNNaN
3-0.019539NaN0.026027
40.044561NaN0.448722
5-2.916266-0.5031031.042819
6-0.0859312.0580931.255430
df.fillna(0)
012
00.5069390.0000000.000000
10.6519950.0000000.000000
20.6452020.0000000.000000
3-0.0195390.0000000.026027
40.0445610.0000000.448722
5-2.916266-0.5031031.042819
6-0.0859312.0580931.255430
df
012
00.506939NaNNaN
10.651995NaNNaN
20.645202NaNNaN
3-0.019539NaN0.026027
40.044561NaN0.448722
5-2.916266-0.5031031.042819
6-0.0859312.0580931.255430

传入一个字典,实现对不同的列填充不同的值

df.fillna({1:0.5,2:-1})
012
00.5069390.500000-1.000000
10.6519950.500000-1.000000
20.6452020.500000-1.000000
3-0.0195390.5000000.026027
40.0445610.5000000.448722
5-2.916266-0.5031031.042819
6-0.0859312.0580931.255430
df
012
00.506939NaNNaN
10.651995NaNNaN
20.645202NaNNaN
3-0.019539NaN0.026027
40.044561NaN0.448722
5-2.916266-0.5031031.042819
6-0.0859312.0580931.255430

参数inplace=True fillna默认会返回新对象,也可以对现有对象进行就地修改

_ = df.fillna(0,inplace=True)
df
012
00.5069390.0000000.000000
10.6519950.0000000.000000
20.6452020.0000000.000000
3-0.0195390.0000000.026027
40.0445610.0000000.448722
5-2.916266-0.5031031.042819
6-0.0859312.0580931.255430

参数method 前向插值方法,与参数limit

df = DataFrame(np.random.randn(6,3))
df.ix[2:,1]= NA
df.ix[4:,2]= NA
df
/Users/wuyihong/anaconda2/envs/python35/lib/python3.5/site-packages/ipykernel/__main__.py:2: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  from ipykernel import kernelapp as app
/Users/wuyihong/anaconda2/envs/python35/lib/python3.5/site-packages/ipykernel/__main__.py:3: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  app.launch_new_instance()
012
0-1.791661-1.0824560.119847
1-0.998142-1.8144830.535919
2-1.240185NaN-0.080930
3-0.472280NaN0.267896
40.135498NaNNaN
5-0.386448NaNNaN
df.fillna(method='ffill')
012
0-1.791661-1.0824560.119847
1-0.998142-1.8144830.535919
2-1.240185-1.814483-0.080930
3-0.472280-1.8144830.267896
40.135498-1.8144830.267896
5-0.386448-1.8144830.267896
df.fillna(method='ffill',limit=2)
012
0-1.791661-1.0824560.119847
1-0.998142-1.8144830.535919
2-1.240185-1.814483-0.080930
3-0.472280-1.8144830.267896
40.135498NaN0.267896
5-0.386448NaN0.267896
data = Series([1,NA,3.5,NA,7])
data
0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64
data.mean()
3.8333333333333335
data.fillna(data.mean())
0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值