动手学数据分析:第二章task1_数据清洗及特征处理-课程

开始之前,导入numpy、pandas包和数据
#加载所需的库
import numpy as np
import pandas as pd
#加载数据train.csv
train=pd.read_csv('train.csv')
train.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

2 第二章:数据清洗及特征处理

我们拿到的数据通常是不干净的,所谓的不干净,就是数据中有缺失值,有一些异常点等,需要经过一定的处理才能继续做后面的分析或建模,所以拿到数据的第一步是进行数据清洗,本章我们将学习缺失值、重复值、字符串和数据转换等操作,将数据清洗成可以分析或建模的亚子。

2.1 缺失值观察与处理

我们拿到的数据经常会有很多缺失值,比如我们可以看到Cabin列存在NaN,那其他列还有没有缺失值,这些缺失值要怎么处理呢

2.1.1 任务一:缺失值观察

(1) 请查看每个特征缺失值个数
(2) 请查看Age, Cabin, Embarked列的数据
以上方式都有多种方式,所以大家多多益善

#写入代码
#方法1:
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
#写入代码
#方法二
train.isnull().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
#>>> df.count()               #按照列来数非缺失值的个数
#>>> df.count(axis=1)     #按照行来数非缺失值的个数
#方法3:
train.shape[0]-train.count()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
#写入代码
train[['Age','Cabin','Embarked']].head(3)
AgeCabinEmbarked
022.0NaNS
138.0C85C
226.0NaNS
2.1.2 任务二:对缺失值进行处理

(1)处理缺失值一般有几种思路

(2) 请尝试对Age列的数据的缺失值进行处理

(3) 请尝试使用不同的方法直接对整张表的缺失值进行处理

处理缺失值的一般思路:
提醒:可使用的函数有—>dropna函数与fillna函数

思路一:

对有缺失值的行和列简单的删除;

#举例
df=pd.DataFrame({'col1':[2,3,5,np.nan,9,np.nan],
                'col2':[4,3,np.nan,5,12,20],
                'col3':[np.nan,15,17,19,10,11]})
df
col1col2col3
02.04.0NaN
13.03.015.0
25.0NaN17.0
3NaN5.019.0
49.012.010.0
5NaN20.011.0
# #删除有缺失的列
# alldrop1=df.dropna(axis=1)
#删除有缺失的行
alldrop1=df.dropna()
alldrop1
col1col2col3
13.03.015.0
49.012.010.0

思路二:

对缺失值进行计算,如用常数、变量的平均值或众数进行填充;

#用常数填充
df1=df.fillna(500)
df1
col1col2col3
02.04.0500.0
13.03.015.0
25.0500.017.0
3500.05.019.0
49.012.010.0
5500.020.011.0
#用字典填充不同的常数
df2=df.fillna({'col1':10,'col2':20,'col3':30})
df2
#如果传入inplace=True可以直接修改原对象
#df1.fillna(0,inplace=True)
col1col2col3
02.04.030.0
13.03.015.0
25.020.017.0
310.05.019.0
49.012.010.0
510.020.011.0
#传入method=” “改变插值方式:
df3=pd.DataFrame(np.random.randint(0,10,(5,5)))
df3.iloc[1:4,3]=np.nan
df3.iloc[2:4,4]=np.nan
df3
01234
08595.03.0
1893NaN4.0
2146NaNNaN
3795NaNNaN
46368.00.0
df3.fillna(method='ffill')#用前面的值来填充
01234
08595.03.0
18935.04.0
21465.04.0
37955.04.0
46368.00.0
df3.fillna(method='bfill',limit=2)#用下一个非缺失值填充该缺失值
#limit参数:限制填充个数
01234
08595.03.0
1893NaN4.0
21468.00.0
37958.00.0
46368.00.0
#传入axis=” “修改填充方向:
df3.fillna(method="ffill",limit=1,axis=1)
01234
08.05.09.05.03.0
18.09.03.03.04.0
21.04.06.06.0NaN
37.09.05.05.0NaN
46.03.06.08.00.0
#用平均数填充
#用到了groupby方法
#https://blog.csdn.net/u010383605/article/details/78879515
df = pd.DataFrame({'code':[1,2,3,4,5,6,7,8],
                   'value':[np.nan,5,7,8,9,10,11,12],                   
                   'value2':[5,np.nan,7,np.nan,9,10,11,12],
                   'indstry':['农业1','农业1','农业1','农业2','农业2','农业4','农业2','农业3']},
                    columns=['code','value','value2','indstry'],
                    index=list('ABCDEFGH'))

# 只留下需要处理的列
cols = [col for col in df.columns if col not in['code','indstry']]
print(cols)
# 分组的列
gp_col = 'indstry'
# 查询nan的列
df_na = df[cols].isna()
# 根据分组计算平均值
df_mean = df.groupby(gp_col)[cols].mean()

print(df)

# 依次处理每一列
for col in cols:
    na_series = df_na[col]
    names = list(df.loc[na_series,gp_col])     

    t = df_mean.loc[names,col]
    t.index = df.loc[na_series,col].index

    # 相同的index进行赋值     
    df.loc[na_series,col] = t

df
['value', 'value2']
   code  value  value2 indstry
A     1    NaN     5.0     农业1
B     2    5.0     NaN     农业1
C     3    7.0     7.0     农业1
D     4    8.0     NaN     农业2
E     5    9.0     9.0     农业2
F     6   10.0    10.0     农业4
G     7   11.0    11.0     农业2
H     8   12.0    12.0     农业3
codevaluevalue2indstry
A16.05.0农业1
B25.06.0农业1
C37.07.0农业1
D48.010.0农业2
E59.09.0农业2
F610.010.0农业4
G711.011.0农业2
H812.012.0农业3

思路三:

在缺失值周围建立模型,然后用模型提供的值进行填充。

下面进行任务处理1——对Age列的数据的缺失值进行处理

#对Age列的数据的缺失值进行处理
#方法一:删除所在行
#个人思路:
#1.先填充一个不可能出现的值,然后找到这个值的所在行索引,利用drop进行删除
train=pd.read_csv('train.csv')
train.head()


train_age1=train['Age'].isnull().value_counts()
print(train_age1)

train['Age']=train['Age'].fillna(9999)
train_age_index=train[(train['Age']==9999)].index.tolist()#找到所在行的索引
train=train.drop(train_age_index)

train.info()
#2.直接用dropna删除
train_age2=train.dropna(subset=['Age'])
train_age2
False    714
True     177
Name: Age, dtype: int64
<class 'pandas.core.frame.DataFrame'>
Int64Index: 714 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    714 non-null int64
Survived       714 non-null int64
Pclass         714 non-null int64
Name           714 non-null object
Sex            714 non-null object
Age            714 non-null float64
SibSp          714 non-null int64
Parch          714 non-null int64
Ticket         714 non-null object
Fare           714 non-null float64
Cabin          185 non-null object
Embarked       712 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 72.5+ KB
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
.......................................
88588603Rice, Mrs. William (Margaret Norton)female39.00538265229.1250NaNQ
88688702Montvila, Rev. Juozasmale27.00021153613.0000NaNS
88788811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S
88989011Behr, Mr. Karl Howellmale26.00011136930.0000C148C
89089103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ

714 rows × 12 columns

#方法二:简单填充数值
#比如填充0
train=pd.read_csv('train.csv')
train.head()
train[train['Age'].isnull()]=0

train.head()

PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
#方法三:填充age的平均值,按男女进行分组

train=pd.read_csv('train.csv')
train.head()

# 需要处理的列
cols = ['Age']
# 分组的列:按男女进行分组
gp_col = 'Sex'
# 查询nan的列
train_na = train[cols].isna()
# 根据分组计算平均值
train_mean = train.groupby(gp_col)[cols].mean()

# 依次处理每一列
for col in cols:
    na_series = train_na[col]
    names = list(train.loc[na_series,gp_col])     

    t = train_mean.loc[names,col]
    t.index = train.loc[na_series,col].index

    # 相同的index进行赋值     
    train.loc[na_series,col] = t
train.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

任务处理2——使用不同的方法直接对整张表的缺失值进行处理

#再导入一遍
train=pd.read_csv('train.csv')
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
#直接删除有缺失值的行
train1=train.dropna()
train1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 183 entries, 1 to 889
Data columns (total 12 columns):
PassengerId    183 non-null int64
Survived       183 non-null int64
Pclass         183 non-null int64
Name           183 non-null object
Sex            183 non-null object
Age            183 non-null float64
SibSp          183 non-null int64
Parch          183 non-null int64
Ticket         183 non-null object
Fare           183 non-null float64
Cabin          183 non-null object
Embarked       183 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 18.6+ KB
#简单填充
train2=train.fillna(0).head()
train2
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.25000S
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.92500S
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.05000S
#向下填充
train3=train.fillna(method='bfill')
train3.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500C85S
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250C123S
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500E46S
#利用众数进行填充,因为个人觉得平均数好像不太靠谱,尤其是性别啥的
train4=train.fillna(train.mode().iloc[0])
train4.info()
train4.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          891 non-null object
Embarked       891 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500B96 B98S
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250B96 B98S
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500B96 B98S

DataFrame.mode(axis=0, numeric_only=False)[source]

轴上可能存在多个众数,这也是为何此函数会返回一个 dataframe

参数:

axis : {0 or ‘index’, 1 or ‘columns’}, default 0

0 or ‘index’ : 获得列的众数

1 or ‘columns’ : 获得行的众数

numeric_only : boolean, default False 该项为 True 则只对数字列进行众数计算

众数填充的程序:

df.fillna(df.mode().iloc[0])

【思考1】dropna和fillna有哪些参数,分别如何使用呢?

dropna参数:

axis: default 0指行,1为列

how: {‘any’, ‘all’}, default ‘any’指带缺失值的所有行;'all’指清除全是缺失值的

thresh: int,保留含有int个非空值的行

subset: 对特定的列进行缺失值删除处理

inplace: 这个很常见,True表示直接在原数据上更改

fillna()函数参数:

inplace参数的取值:True、False

True:直接修改原对象

False:创建一个副本,修改副本,原对象不变(缺省默认)

method参数的取值 : {‘pad’, ‘ffill’,‘backfill’, ‘bfill’, None}, default None

pad/ffill:用前一个非缺失值去填充该缺失值

backfill/bfill:用下一个非缺失值填充该缺失值

None:指定一个值去替换缺失值(缺省默认这种方式)

limit参数:限制填充个数

axis参数:修改填充方向

df = pd.DataFrame({ 'a': [3,5,np.NaN,5,np.NaN,4,3,7],
                    'b': [6,7,8,9,0,2,3,np.NaN],
                    'c': [3,2,1,5,6,5,4,3]})
print(df)
df.dropna( subset=['a'], inplace=True )
print(df)
     a    b  c
0  3.0  6.0  3
1  5.0  7.0  2
2  NaN  8.0  1
3  5.0  9.0  5
4  NaN  0.0  6
5  4.0  2.0  5
6  3.0  3.0  4
7  7.0  NaN  3
     a    b  c
0  3.0  6.0  3
1  5.0  7.0  2
3  5.0  9.0  5
5  4.0  2.0  5
6  3.0  3.0  4
7  7.0  NaN  3

【思考2】检索空缺值用np.nan要比用None好,这是为什么?

数值列读取数据后,空缺值的数据类型为float64所以用None一般索引不到,比较的时候最好用np.nan

【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

2.2 重复值观察与处理

由于这样那样的原因,数据中会不会存在重复值呢,如果存在要怎样处理呢

2.2.1 任务一:请查看数据中的重复值
#写入代码
#再导入一遍
train=pd.read_csv('train.csv')

#方法1:
pd.concat([train.drop_duplicates(),train.drop_duplicates(keep=False)]).drop_duplicates(keep=False)
#方法2:
train[train.duplicated()]
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
2.2.2 任务二:对重复值进行处理

(1)重复值有哪些处理方式呢?

(2)处理我们数据的重复值

方法多多益善

#重复值有哪些处理方式:直接删除重复的行
#https://www.cnblogs.com/trotl/p/11876292.html
train.drop_duplicates().head()

PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
2.2.3 任务三:将前面清洗的数据保存为csv格式
#写入代码
train.to_csv('test_clear.csv')


2.3 特征观察与处理

我们对特征进行一下观察,可以把特征大概分为两大类:
数值型特征:Survived ,Pclass, Age ,SibSp, Parch, Fare,其中Survived, Pclass为离散型数值特征,Age,SibSp, Parch, Fare为连续型数值特征
文本型特征:Name, Sex, Cabin,Embarked, Ticket,其中Sex, Cabin, Embarked, Ticket为类别型文本特征,数值型特征一般可以直接用于模型的训练,但有时候为了模型的稳定性及鲁棒性会对连续变量进行离散化。文本型特征往往需要转换成数值型特征才能用于建模分析。

2.3.1 任务一:对年龄进行分箱(离散化)处理

(1) 分箱操作是什么?

(2) 将连续变量Age平均分箱成5个年龄段,并分别用类别变量12345表示

(3) 将连续变量Age划分为[0,5) [5,15) [15,30) [30,50) [50,80)五个年龄段,并分别用类别变量12345表示

(4) 将连续变量Age按10% 30% 50 70% 90%五个年龄段,并用分类变量12345表示

(5) 将上面的获得的数据分别进行保存,保存为csv格式

分箱操作是什么:

数据分箱(也称为离散分箱或分段)是一种数据预处理技术,用于减少次要观察误差的影响,是一种将多个连续值分组为较少数量的“分箱”的方法。

一般在建立分类模型时,需要对连续变量离散化,特征离散化后,模型会更稳定,降低了模型过拟合的风险。比如在建立申请评分卡模型时用logsitic作为基模型就需要对连续变量进行离散化,离散化通常采用分箱法。分箱的有以下重要性及其优势:

#将连续变量Age平均分箱成5个年龄段,并分别用类别变量12345表示

train['AgeBand'] = pd.cut(train['Age'], 5,labels = ['1','2','3','4','5'])
train.to_csv('test_ave.csv')
train.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedAgeBand
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS2
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C3
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS2
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S3
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS3

#将连续变量Age划分为[0,5) [5,15) [15,30) [30,50) [50,80)五个年龄段,并分别用类别变量12345表示
train['AgeBand'] = pd.cut(train['Age'],[0,5,15,30,50,80],labels = ['1','2','3','4','5'])
train.to_csv('test_cut.csv')
train.head(3)


PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedAgeBand
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS3
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C4
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS3
#写入代码
#将连续变量Age按10% 30% 50 70% 90%五个年龄段,并用分类变量12345表示
train['AgeBand'] = pd.qcut(train['Age'],[0,0.1,0.3,0.5,0.7,0.9],labels = ['1','2','3','4','5'])
train.to_csv('test_pr.csv')
train.head()


PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedAgeBand
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS2
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C5
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS3
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S4
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS4

【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html

【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html

2.3.2 任务二:对文本变量进行转换

(1) 查看文本变量名及种类
(2) 将文本变量Sex, Cabin ,Embarked用数值变量12345表示
(3) 将文本变量Sex, Cabin, Embarked用one-hot编码表示

#查看类别文本变量名及种类

#方法一: value_counts
train['Sex'].value_counts()
train['Cabin'].value_counts()
train['Embarked'].value_counts()
#方法二: unique
train['Sex'].unique()
train['Sex'].nunique()
S    644
C    168
Q     77
Name: Embarked, dtype: int64
#将类别文本转换为12345

#方法一: replace
train['Sex_num'] = train['Sex'].replace(['male','female'],[1,2])
train.head()
#方法二: map
train['Sex_num'] = train['Sex'].map({'male': 1, 'female': 2})
train.head()
#方法三: 使用sklearn.preprocessing的LabelEncoder
from sklearn.preprocessing import LabelEncoder
for feat in ['Cabin', 'Ticket']:
    lbl = LabelEncoder()  
    label_dict = dict(zip(train[feat].unique(), range(train[feat].nunique())))
    train[feat + "_labelEncode"] = train[feat].map(label_dict)
    train[feat + "_labelEncode"] = lbl.fit_transform(train[feat].astype(str))

train.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedAgeBandSex_numCabin_labelEncodeTicket_labelEncode
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS21147523
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C5281596
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS32147669
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S425549
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS41147472
#将类别文本转换为one-hot编码

#方法一: OneHotEncoder
for feat in ["Age", "Embarked"]:
#     x = pd.get_dummies(df["Age"] // 6)
#     x = pd.get_dummies(pd.cut(df['Age'],5))
    x = pd.get_dummies(train[feat], prefix=feat)
    train = pd.concat([train, x], axis=1)
    #df[feat] = pd.get_dummies(df[feat], prefix=feat)
    
train.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFare...Age_65.0Age_66.0Age_70.0Age_70.5Age_71.0Age_74.0Age_80.0Embarked_CEmbarked_QEmbarked_S
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500...0000000001
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833...0000000100
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250...0000000001
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000...0000000001
4503Allen, Mr. William Henrymale35.0003734508.0500...0000000001

5 rows × 107 columns

2.3.3 任务三:从纯文本Name特征里提取出Titles的特征(所谓的Titles就是Mr,Miss,Mrs等)
train['Title'] = train.Name.str.extract('([A-Za-z]+)\.', expand=False)
train.head()


PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFare...Age_66.0Age_70.0Age_70.5Age_71.0Age_74.0Age_80.0Embarked_CEmbarked_QEmbarked_STitle
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500...000000001Mr
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833...000000100Mrs
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250...000000001Miss
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000...000000001Mrs
4503Allen, Mr. William Henrymale35.0003734508.0500...000000001Mr

5 rows × 108 columns

#保存最终你完成的已经清理好的数据
train.to_csv('test_fin.csv')

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值