数据预处理
from IPython. display import Image
% matplotlib inline
处理缺失值
识别表格中的缺失值
import pandas as pd
from io import StringIO
import sys
csv_data = \
'''A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
10.0,11.0,12.0,'''
if ( sys. version_info < ( 3 , 0 ) ) :
csv_data = unicode ( csv_data)
df = pd. read_csv( StringIO( csv_data) )
df
A
B
C
D
0
1.0
2.0
3.0
4.0
1
5.0
6.0
NaN
8.0
2
10.0
11.0
12.0
NaN
df. isnull( ) . sum ( )
A 0
B 0
C 1
D 1
dtype: int64
df. columns
Index(['A', 'B', 'C', 'D'], dtype='object')
df. values
array([[ 1., 2., 3., 4.],
[ 5., 6., nan, 8.],
[10., 11., 12., nan]])
删除具有缺失值的训练样本或特征
df. dropna( axis= 0 )
A
B
C
D
0
1.0
2.0
3.0
4.0
df. dropna( axis= 1 )
A
B
0
1.0
2.0
1
5.0
6.0
2
10.0
11.0
df. dropna( how= 'all' )
A
B
C
D
0
1.0
2.0
3.0
4.0
1
5.0
6.0
NaN
8.0
2
10.0
11.0
12.0
NaN
df. dropna( thresh= 4 )
A
B
C
D
0
1.0
2.0
3.0
4.0
df. dropna( subset= [ 'C' ] )
A
B
C
D
0
1.0
2.0
3.0
4.0
2
10.0
11.0
12.0
NaN
直接将数据中的缺失值进行删除,方法简便,但可能会造成很多重要信息的丢失
缺失值填充
df. values
array([[ 1., 2., 3., 4.],
[ 5., 6., nan, 8.],
[10., 11., 12., nan]])
from sklearn. impute import SimpleImputer
import numpy as np
imr = SimpleImputer( missing_values= np. nan, strategy= 'mean' )
"""
- If "mean", then replace missing values using the mean along
each column. Can only be used with numeric data.
- If "median", then replace missing values using the median along
each column. Can only be used with numeric data.
- If "most_frequent", then replace missing using the most frequent
value along each column. Can be used with strings or numeric data.
- If "constant", then replace missing values with fill_value. Can be
used with strings or numeric data.
"""
imr = imr. fit( df. values)
imputed_data = imr. transform( df. values)
imputed_data
array([[ 1. , 2. , 3. , 4. ],
[ 5. , 6. , 7.5, 8. ],
[10. , 11. , 12. , 6. ]])
更加简便的插值方式,使用fillna(),其中将插值策略作为参数传入
df. fillna( df. mean( ) )
A
B
C
D
0
1.0
2.0
3.0
4.0
1
5.0
6.0
7.5
8.0
2
10.0
11.0
12.0
6.0
这里的计算方法是: mean=sum−value−not−nannum−value−not−nan mean = \cfrac{sum-value-not-nan}{num-value-not-nan} m e a n = n u m − v a l u e − n o t − n a n s u m − v a l u e − n o t − n a n
了解scikit-learn相关API
Image( filename= 'images/04_01.png' , width= 400 )
Image( filename= 'images/04_02.png' , width= 300 )
处理类别型数据
连续型和标称型特征
import pandas as pd
df = pd. DataFrame( [ [ 'green' , 'M' , 10.1 , 'class2' ] ,
[ 'red' , 'L' , 13.5 , 'class1' ] ,
[ 'blue' , 'XL' , 15.3 , 'class2' ] ] )
df. columns = [ 'color' , 'size' , 'price' , 'classlabel' ]
df
color
size
price
classlabel
0
green
M
10.1
class2
1
red
L
13.5
class1
2
blue
XL
15.3
class2
连续型特征编码映射
该特征虽然像是类别型特征,但其本质上具有大小关系
size_mapping = {
'XL' : 3 ,
'L' : 2 ,
'M' : 1 }
df[ 'size' ] = df[ 'size' ] . map ( size_mapping)
df
color
size
price
classlabel
0
green
1
10.1
class2
1
red
2
13.5
class1
2
blue
3
15.3
class2
inv_size_mapping = {
v: k for k, v in size_mapping. items( ) }
df[ 'size' ] . map ( inv_size_mapping)
0 M
1 L
2 XL
Name: size, dtype: object
类别标签编码
这里需要注意,类别标签不存在序号关系,所以将特定字符串编码为哪一个具体整数值并不重要,因此这里可以使用枚举方法
import numpy as np
class_mapping = {
label: idx for idx, label in enumerate ( np. unique( df[ 'classlabel' ] ) ) }
class_mapping
{'class1': 0, 'class2': 1}
df[ 'classlabel' ] = df[ 'classlabel' ] . map ( class_mapping)
df
color
size
price
classlabel
0
green
1
10.1
1
1
red
2
13.5
0
2
blue
3
15.3
1
inv_class_mapping = {
v: k for k, v in class_mapping. items( ) }
df[ 'classlabel' ] = df[ 'classlabel' ] . map ( inv_class_mapping)
df
color
size
price
classlabel
0
green
1
10.1
class2
1
red
2
13.5
class1
2
blue
3
15.3
class2
from sklearn. preprocessing import LabelEncoder
class_le = LabelEncoder( )
y = class_le. fit_transform( df[ 'classlabel' ] . values)
y
array([1, 0, 1])
class_le. inverse_transform( y)
array(['class2', 'class1', 'class2'], dtype=object)
对标称型特征进行独热编码
可以直接使用LabelEncoder对类别型的特征进行编码映射,且该方法没有考虑特征值之间大小关系
但是,使用该方法的结果是,数值型的编码结果会依然被算法视为有序的
所以可以采用独热编码,处理无序特征
X = df[ [ 'color' , 'size' , 'price'</