我们知道Pandas中最常用的两种数据类型就是Series 和DataFrame,因此主要学习者两种数据类型的缺失数据的处理方法
拿到一份数据,首先查看数据完整性,了解缺失信息; 然后针对缺失的数据,是要进行填充?还是删除?还是插值?选择合适的方式对拿到的数据进行数据预处理。
1. 了解缺失信息:主要就是用isna和notna方法,对于isna方法,如果为缺失值,则返回True,如果不为缺失值,则返回False;notna方法则反之
import numpy as np
import pandas as pd
df = pd. read_csv( 'E:/data/table_missing.csv' )
df. head( )
School Class ID Gender Address Height Weight Math Physics 0 S_1 C_1 NaN M street_1 173 NaN 34.0 A+ 1 S_1 C_1 NaN F street_2 192 NaN 32.5 B+ 2 S_1 C_1 1103.0 M street_2 186 NaN 87.2 B+ 3 S_1 NaN NaN F street_2 167 81.0 80.4 NaN 4 S_1 C_1 1105.0 NaN street_4 159 64.0 84.8 A-
df[ 'Physics' ] . isna( ) . head( )
0 False
1 False
2 False
3 True
4 False
Name: Physics, dtype: bool
有时候对于一份数据,我们并不关心某一个缺失值,而是关心某个需要分析的属性值的数据完整性。如果该属性数据缺失较少, 可以考虑用合适的插值方法来对数据进行拟合(模拟);如果数据缺失较多,或许该数据数据就不适合应用到我们的分析当中。 可以利用如下方法来了解拿到手的数据的一个完整性情况:
df. isna( ) . sum ( )
School 0
Class 4
ID 6
Gender 7
Address 0
Height 0
Weight 13
Math 5
Physics 4
dtype: int64
df. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 School 35 non-null object
1 Class 31 non-null object
2 ID 29 non-null float64
3 Gender 28 non-null object
4 Address 35 non-null object
5 Height 35 non-null int64
6 Weight 22 non-null float64
7 Math 30 non-null float64
8 Physics 31 non-null object
dtypes: float64(3), int64(1), object(5)
memory usage: 2.6+ KB
2.处理缺失值
2.1 填充
主要用到的方法是:DataFrame.fillna(value = None,method = None,axis = None, inplace = False,limit=None, downcast=None, **kwargs) 其中, value是要用去填充缺失值的值;
df. fillna( 'missing' ) . head( )
School Class ID Gender Address Height Weight Math Physics 0 S_1 C_1 1103.0 M street_1 173 81.0 34.0 A+ 1 S_1 C_1 1103.0 F street_2 192 81.0 32.5 B+ 2 S_1 C_1 1103.0 M street_2 186 81.0 87.2 B+ 3 S_1 C_1 1105.0 F street_2 167 81.0 80.4 A- 4 S_1 C_1 1105.0 M street_4 159 64.0 84.8 A-
df. fillna( method= 'ffill' ) . head( )
School Class ID Gender Address Height Weight Math Physics 0 S_1 C_1 NaN M street_1 173 NaN 34.0 A+ 1 S_1 C_1 NaN F street_2 192 NaN 32.5 B+ 2 S_1 C_1 1103.0 M street_2 186 NaN 87.2 B+ 3 S_1 C_1 1103.0 F street_2 167 81.0 80.4 B+ 4 S_1 C_1 1105.0 F street_4 159 64.0 84.8 A-
df. fillna( method= 'backfill' ) . head( )
School Class ID Gender Address Height Weight Math Physics 0 S_1 C_1 1103.0 M street_1 173 81.0 34.0 A+ 1 S_1 C_1 1103.0 F street_2 192 81.0 32.5 B+ 2 S_1 C_1 1103.0 M street_2 186 81.0 87.2 B+ 3 S_1 C_1 1105.0 F street_2 167 81.0 80.4 A- 4 S_1 C_1 1105.0 M street_4 159 64.0 84.8 A-
2.2 剔除
主要用的是函数:DataFrame.dropna(axis=0, how=‘any’, thresh=None, subset=None, inplace=False) 删除含有空值的行或者列
df. dropna( ) . head( )
School Class ID Gender Address Height Weight Math Physics 5 S_1 C_2 1201.0 M street_5 159 68.0 97.0 A- 6 S_1 C_2 1202.0 F street_4 176 94.0 63.5 B- 12 S_1 C_3 1303.0 M street_7 188 82.0 49.7 B 17 S_2 C_1 2103.0 M street_4 157 61.0 52.5 B- 21 S_2 C_2 2202.0 F street_7 194 77.0 68.5 B+
df. dropna( axis= 1 ) . head( )
School Address Height 0 S_1 street_1 173 1 S_1 street_2 192 2 S_1 street_2 186 3 S_1 street_2 167 4 S_1 street_4 159
df. dropna( how= 'all' ) . head( )
School Class ID Gender Address Height Weight Math Physics 0 S_1 C_1 NaN M street_1 173 NaN 34.0 A+ 1 S_1 C_1 NaN F street_2 192 NaN 32.5 B+ 2 S_1 C_1 1103.0 M street_2 186 NaN 87.2 B+ 3 S_1 NaN NaN F street_2 167 81.0 80.4 NaN 4 S_1 C_1 1105.0 NaN street_4 159 64.0 84.8 A-
df. dropna( thresh= 2 ) . head( )
School Class ID Gender Address Height Weight Math Physics 0 S_1 C_1 NaN M street_1 173 NaN 34.0 A+ 1 S_1 C_1 NaN F street_2 192 NaN 32.5 B+ 2 S_1 C_1 1103.0 M street_2 186 NaN 87.2 B+ 3 S_1 NaN NaN F street_2 167 81.0 80.4 NaN 4 S_1 C_1 1105.0 NaN street_4 159 64.0 84.8 A-
2.3插值
2.3.1 线性插值
s = pd. Series( [ 1 , 10 , 15 , - 5 , - 2 , np. nan, np. nan, 28 ] )
s. plot( )
s. interpolate( ) . plot( )