7.1处理缺失数据
import pandas as pd
import numpy as np
from numpy import nan as NA
string_data=pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data.isnull()
0 False
1 False
2 True
3 False
dtype: bool
1.过滤缺失数据
对于一个Series,dropna返回一个仅含非空数据和索引值的Series
string_data.dropna()
0 aardvark
1 artichoke
3 avocado
dtype: object
对于DataFrame对象,dropna默认丢弃任何含有缺失值的行
data=pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],[NA, NA, NA], [NA, 6.5, 3.]])
data.dropna()
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | 6.5 | 3.0 |
传入how=’all’将只丢弃全为NA的那些行
data.dropna(how='all')
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | 6.5 | 3.0 |
1 | 1.0 | NaN | NaN |
3 | NaN | 6.5 | 3.0 |
传入axis=1,将丢弃列
data[4]=NA
data
0 | 1 | 2 | 4 | |
---|---|---|---|---|
0 | 1.0 | 6.5 | 3.0 | NaN |
1 | 1.0 | NaN | NaN | NaN |
2 | NaN | NaN | NaN | NaN |
3 | NaN | 6.5 | 3.0 | NaN |
data.dropna(axis=1,how='all')
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | 6.5 | 3.0 |
1 | 1.0 | NaN | NaN |
2 | NaN | NaN | NaN |
3 | NaN | 6.5 | 3.0 |
thresh=int参数表示删除值个数小于int的值
data.dropna(axis=0,thresh=2)
0 | 1 | 2 | 4 | |
---|---|---|---|---|
0 | 1.0 | 6.5 | 3.0 | NaN |
3 | NaN | 6.5 | 3.0 | NaN |
2.填充缺失值
data.fillna(0)
0 | 1 | 2 | 4 | |
---|---|---|---|---|
0 | 1.0 | 6.5 | 3.0 | 0.0 |
1 | 1.0 | 0.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 0.0 | 6.5 | 3.0 | 0.0 |
通过字典方式,可以对不同列填充不同值
data.fillna({
2:1,4:2})
0 | 1 | 2 | 4 | |
---|---|---|---|---|
0 | 1.0 | 6.5 | 3.0 | 2.0 |
1 | 1.0 | NaN | 1.0 | 2.0 |
2 | NaN | NaN | 1.0 | 2.0 |
3 | NaN | 6.5 | 3.0 | 2.0 |
fillna默认会返回新对象,但也可以通过设置inplace=True对原对象修改
data.fillna(0,inplace=True)
data
0 | 1 | 2 | 4 | |
---|---|---|---|---|
0 | 1.0 | 6.5 | 3.0 | 0.0 |
1 | 1.0 | 0.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 0.0 | 6.5 | 3.0 | 0.0 |
还可以通过method参数指定向前/向后填充
data=pd.DataFrame(np.random.randn(4,3))
data.iloc[0,1:]=NA
data.iloc[2:,:2]=NA
data
0 | 1 | 2 | |
---|---|---|---|
0 | -1.405865 | NaN | NaN |
1 | -1.086872 | -0.087551 | 0.666916 |
2 | NaN | NaN | -0.037634 |
3 | NaN | NaN | -0.362835 |
ffill表示用前面的值填充NA,limit表示最多填充几个
data.fillna(method='ffill',limit=1)
0 | 1 | 2 | |
---|---|---|---|
0 | -1.405865 | NaN | NaN |
1 | -1.086872 | -0.087551 | 0.666916 |
2 | -1.086872 | -0.087551 | -0.037634 |
3 | NaN | NaN | -0.362835 |
表示用后面的值填充NA
data.fillna(method='bfill')
0 | 1 | 2 | |
---|---|---|---|
0 | -1.405865 | -0.087551 | 0.666916 |
1 | -1.086872 | -0.087551 | 0.666916 |
2 | NaN | NaN | -0.037634 |
3 | NaN | NaN | -0.362835 |
还可以使用平均值来进行填充
data[0].fillna(value=data[0].mean())
0 -1.405865
1 -1.086872
2 -1.246369
3 -1.246369
Name: 0, dtype: float64
7.2数据转换
1.删除重复数据
data=pd.DataFrame({
'k1':['one','two']*3+['two'],'k2':[1,1,2,3,3,