先导入相应的包和数据
import numpy as np
import pandas as pd
df= pd. read_csv( 'train.csv' )
df. head( )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
数据清洗
1.缺失值的观察和处理
通过info()或isnull()函数可以查看每列缺失值的个数
df. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
df. isnull( ) . sum ( )
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
从上面的结果可以看出,Age,Cabin,Embarked这三列都有缺失值,可以通过head函数大概查看这三列的情况
df[ [ 'Age' , 'Cabin' , 'Embarked' ] ] . head( 5 )
Age Cabin Embarked 0 22.0 NaN S 1 38.0 C85 C 2 26.0 NaN S 3 35.0 C123 S 4 35.0 NaN S
处理缺失值可以通过过滤缺失值或补全缺失值的方法,过滤缺失值主要是删除缺失值所在的行或列,补全缺失值可以通过常数补全,插值补全,中位数或均值补全
df. dropna( ) . head( 3 )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
df. fillna( 0 ) . head( )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 0 S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 0 S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 0 S
data= df. fillna( { 'Age' : 0 } )
data. head( )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
fillna函数相关参数: value:标量值或字典型对象用于填充缺失值 method:插值方法,默认是’fill’ axis:需要填充的轴,默认axis=0 inplace:修改被调用的对象,而不是生成一个备份 limit:用于前向或后向填充时最大的填充范围
2.重复值的观察和处理
对重复值的查看主要通过duplicated()函数
df[ df. duplicated( ) ]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
对重复值的处理方法主要是去除重复值
df. drop_duplicates( ) . head( )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
data. to_csv( 'test_clear.csv' )
特征观察和处理
特征可分成数值型特征和文本型特征,数值型特征也可分为离散型数值特征和连续型数值特征
1.离散化和分箱
分箱操作:把连续型的数据离散化,把连续型变量按照不同的标准放入不同的“箱子”里 这里对Age提出三种分箱方法,一个是把Age平均分成5个年龄段,分别用类变量12345表示,一个是给出预定的区间进行分箱,一个是按百分比进行分箱
df[ 'AgeBand' ] = pd. cut( df[ 'Age' ] , 5 , labels= [ '1' , '2' , '3' , '4' , '5' ] )
df. head( )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked AgeBand 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 2 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 3 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 2 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 3 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 3
df. to_csv( 'test_ave.csv' )
df[ 'AgeBand' ] = pd. cut( df[ 'Age' ] , [ 0 , 5 , 15 , 30 , 50 , 80 ] , labels= [ '1' , '2' , '3' , '4' , '5' ] )
df. head( )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked AgeBand 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 3 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 4 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 4
df. to_csv( 'test_cut.csv' )
df[ 'AgeBand' ] = pd. qcut( df[ 'Age' ] , [ 0 , 0.1 , 0.3 , 0.5 , 0.7 , 0.9 ] , labels= [ '1' , '2' , '3' , '4' , '5' ] )
df. head( )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked AgeBand 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 2 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 5 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 4
df. to_csv( 'test_pr.csv' )
2.对文本变量进行转换
查看类别文本变量名可以用value_counts或unique函数
df[ 'Sex' ] . value_counts( )
male 577
female 314
Name: Sex, dtype: int64
df[ 'Cabin' ] . value_counts( )
G6 4
C23 C25 C27 4
B96 B98 4
F33 3
F2 3
C22 C26 3
D 3
E101 3
C125 2
B77 2
C83 2
B18 2
E33 2
F4 2
C124 2
D35 2
B57 B59 B63 B66 2
D26 2
C78 2
B22 2
B58 B60 2
D33 2
C52 2
E25 2
C68 2
B28 2
E44 2
E121 2
B49 2
B51 B53 B55 2
..
A19 1
D46 1
B71 1
E63 1
A14 1
B82 B84 1
C46 1
C85 1
D30 1
D49 1
D21 1
B78 1
B37 1
B73 1
A16 1
C91 1
B69 1
A6 1
B30 1
F G63 1
E58 1
B101 1
C82 1
E46 1
E68 1
D6 1
B102 1
B4 1
T 1
A34 1
Name: Cabin, Length: 147, dtype: int64
df[ 'Embarked' ] . value_counts( )
S 644
C 168
Q 77
Name: Embarked, dtype: int64
df[ 'Sex' ] . unique( )
array(['male', 'female'], dtype=object)
df[ 'Cabin' ] . unique( )
array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',
'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35',
'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19',
'B49', 'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54',
'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40',
'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44',
'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14',
'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38',
'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68',
'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48',
'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63',
'C62 C64', 'E24', 'C90', 'C45', 'E8', 'B101', 'D45', 'C46', 'D30',
'E121', 'D11', 'E77', 'F38', 'B3', 'D6', 'B82 B84', 'D17', 'A36',
'B102', 'B69', 'E49', 'C47', 'D28', 'E17', 'A24', 'C50', 'B42',
'C148'], dtype=object)
df[ 'Embarked' ] . unique( )
array(['S', 'C', 'Q', nan], dtype=object)
有时候为了方便,会把类别文本转换为1,2,3,4,5这样的标签,下面分别介绍用replace函数,map函数,sklearn.preprocess的LabelEncoder函数进行类别文本的转换
df[ 'Sex_num' ] = df[ 'Sex' ] . replace( [ 'male' , 'female' ] , [ 1 , 2 ] )
df. head( 3 )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked AgeBand Sex_num 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 2 1 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 5 2 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 2
df[ 'Sex_num' ] = df[ 'Sex' ] . map ( { 'male' : 1 , 'female' : 2 } )
df. head( 3 )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked AgeBand Sex_num 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 2 1 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 5 2 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 2
from sklearn. preprocessing import LabelEncoder
for feat in [ 'Cabin' , 'Ticket' ] :
lbl= LabelEncoder( )
label_dict= dict ( zip ( df[ feat] . unique( ) , range ( df[ feat] . nunique( ) ) ) )
df[ feat+ "_labelEncode" ] = df[ feat] . map ( label_dict)
df[ feat+ "_labelEncode" ] = lbl. fit_transform( df[ feat] . astype( str ) )
df. head( 3 )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked AgeBand Sex_num Cabin_labelEncode Ticket_labelEncode Title 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 2 1 147 523 Mr 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 5 2 81 596 Mrs 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 2 147 669 Miss
one-hot编码:one-hot编码可以将离散空间的特征取值扩展到欧式空间,离散特征的某个取值就对应欧式空间的某个点
for feat in [ 'Age' , 'Embarked' ] :
x= pd. get_dummies( df[ feat] , prefix= feat)
df= pd. concat( [ df, x] , axis= 1 )
df. head( )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare ... Age_65.0 Age_66.0 Age_70.0 Age_70.5 Age_71.0 Age_74.0 Age_80.0 Embarked_C Embarked_Q Embarked_S 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 ... 0 0 0 0 0 0 0 0 0 1 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 ... 0 0 0 0 0 0 0 1 0 0 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 ... 0 0 0 0 0 0 0 0 0 1 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 ... 0 0 0 0 0 0 0 0 0 1 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 ... 0 0 0 0 0 0 0 0 0 1
5 rows × 108 columns
3.纯文本中提取特征
df[ 'Title' ] = df. Name. str . extract( '([A-Za-z]+)\.' , expand= False )
df. head( )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked AgeBand Sex_num Cabin_labelEncode Ticket_labelEncode Title 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 2 1 147 523 Mr 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 5 2 81 596 Mrs 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 2 147 669 Miss 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 2 55 49 Mrs 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 4 1 147 472 Mr
df. to_csv( 'test_fin.csv' )