开始之前,导入numpy、pandas包和数据
#加载所需的库
import numpy as np
import pandas as pd
#加载数据train.csv
train=pd.read_csv('train.csv')
train.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
2 第二章:数据清洗及特征处理
我们拿到的数据通常是不干净的,所谓的不干净,就是数据中有缺失值,有一些异常点等,需要经过一定的处理才能继续做后面的分析或建模,所以拿到数据的第一步是进行数据清洗,本章我们将学习缺失值、重复值、字符串和数据转换等操作,将数据清洗成可以分析或建模的亚子。
2.1 缺失值观察与处理
我们拿到的数据经常会有很多缺失值,比如我们可以看到Cabin列存在NaN,那其他列还有没有缺失值,这些缺失值要怎么处理呢
2.1.1 任务一:缺失值观察
(1) 请查看每个特征缺失值个数
(2) 请查看Age, Cabin, Embarked列的数据
以上方式都有多种方式,所以大家多多益善
#写入代码
#方法1:
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
#写入代码
#方法二
train.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
#>>> df.count() #按照列来数非缺失值的个数
#>>> df.count(axis=1) #按照行来数非缺失值的个数
#方法3:
train.shape[0]-train.count()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
#写入代码
train[['Age','Cabin','Embarked']].head(3)
Age | Cabin | Embarked | |
---|---|---|---|
0 | 22.0 | NaN | S |
1 | 38.0 | C85 | C |
2 | 26.0 | NaN | S |
2.1.2 任务二:对缺失值进行处理
(1)处理缺失值一般有几种思路
(2) 请尝试对Age列的数据的缺失值进行处理
(3) 请尝试使用不同的方法直接对整张表的缺失值进行处理
处理缺失值的一般思路:
提醒:可使用的函数有—>dropna函数与fillna函数
思路一:
对有缺失值的行和列简单的删除;
#举例
df=pd.DataFrame({'col1':[2,3,5,np.nan,9,np.nan],
'col2':[4,3,np.nan,5,12,20],
'col3':[np.nan,15,17,19,10,11]})
df
col1 | col2 | col3 | |
---|---|---|---|
0 | 2.0 | 4.0 | NaN |
1 | 3.0 | 3.0 | 15.0 |
2 | 5.0 | NaN | 17.0 |
3 | NaN | 5.0 | 19.0 |
4 | 9.0 | 12.0 | 10.0 |
5 | NaN | 20.0 | 11.0 |
# #删除有缺失的列
# alldrop1=df.dropna(axis=1)
#删除有缺失的行
alldrop1=df.dropna()
alldrop1
col1 | col2 | col3 | |
---|---|---|---|
1 | 3.0 | 3.0 | 15.0 |
4 | 9.0 | 12.0 | 10.0 |
思路二:
对缺失值进行计算,如用常数、变量的平均值或众数进行填充;
#用常数填充
df1=df.fillna(500)
df1
col1 | col2 | col3 | |
---|---|---|---|
0 | 2.0 | 4.0 | 500.0 |
1 | 3.0 | 3.0 | 15.0 |
2 | 5.0 | 500.0 | 17.0 |
3 | 500.0 | 5.0 | 19.0 |
4 | 9.0 | 12.0 | 10.0 |
5 | 500.0 | 20.0 | 11.0 |
#用字典填充不同的常数
df2=df.fillna({'col1':10,'col2':20,'col3':30})
df2
#如果传入inplace=True可以直接修改原对象
#df1.fillna(0,inplace=True)
col1 | col2 | col3 | |
---|---|---|---|
0 | 2.0 | 4.0 | 30.0 |
1 | 3.0 | 3.0 | 15.0 |
2 | 5.0 | 20.0 | 17.0 |
3 | 10.0 | 5.0 | 19.0 |
4 | 9.0 | 12.0 | 10.0 |
5 | 10.0 | 20.0 | 11.0 |
#传入method=” “改变插值方式:
df3=pd.DataFrame(np.random.randint(0,10,(5,5)))
df3.iloc[1:4,3]=np.nan
df3.iloc[2:4,4]=np.nan
df3
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 8 | 5 | 9 | 5.0 | 3.0 |
1 | 8 | 9 | 3 | NaN | 4.0 |
2 | 1 | 4 | 6 | NaN | NaN |
3 | 7 | 9 | 5 | NaN | NaN |
4 | 6 | 3 | 6 | 8.0 | 0.0 |
df3.fillna(method='ffill')#用前面的值来填充
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 8 | 5 | 9 | 5.0 | 3.0 |
1 | 8 | 9 | 3 | 5.0 | 4.0 |
2 | 1 | 4 | 6 | 5.0 | 4.0 |
3 | 7 | 9 | 5 | 5.0 | 4.0 |
4 | 6 | 3 | 6 | 8.0 | 0.0 |
df3.fillna(method='bfill',limit=2)#用下一个非缺失值填充该缺失值
#limit参数:限制填充个数
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 8 | 5 | 9 | 5.0 | 3.0 |
1 | 8 | 9 | 3 | NaN | 4.0 |
2 | 1 | 4 | 6 | 8.0 | 0.0 |
3 | 7 | 9 | 5 | 8.0 | 0.0 |
4 | 6 | 3 | 6 | 8.0 | 0.0 |
#传入axis=” “修改填充方向:
df3.fillna(method="ffill",limit=1,axis=1)
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 8.0 | 5.0 | 9.0 | 5.0 | 3.0 |
1 | 8.0 | 9.0 | 3.0 | 3.0 | 4.0 |
2 | 1.0 | 4.0 | 6.0 | 6.0 | NaN |
3 | 7.0 | 9.0 | 5.0 | 5.0 | NaN |
4 | 6.0 | 3.0 | 6.0 | 8.0 | 0.0 |
#用平均数填充
#用到了groupby方法
#https://blog.csdn.net/u010383605/article/details/78879515
df = pd.DataFrame({'code':[1,2,3,4,5,6,7,8],
'value':[np.nan,5,7,8,9,10,11,12],
'value2':[5,np.nan,7,np.nan,9,10,11,12],
'indstry':['农业1','农业1','农业1','农业2','农业2','农业4','农业2','农业3']},
columns=['code','value','value2','indstry'],
index=list('ABCDEFGH'))
# 只留下需要处理的列
cols = [col for col in df.columns if col not in['code','indstry']]
print(cols)
# 分组的列
gp_col = 'indstry'
# 查询nan的列
df_na = df[cols].isna()
# 根据分组计算平均值
df_mean = df.groupby(gp_col)[cols].mean()
print(df)
# 依次处理每一列
for col in cols:
na_series = df_na[col]
names = list(df.loc[na_series,gp_col])
t = df_mean.loc[names,col]
t.index = df.loc[na_series,col].index
# 相同的index进行赋值
df.loc[na_series,col] = t
df
['value', 'value2']
code value value2 indstry
A 1 NaN 5.0 农业1
B 2 5.0 NaN 农业1
C 3 7.0 7.0 农业1
D 4 8.0 NaN 农业2
E 5 9.0 9.0 农业2
F 6 10.0 10.0 农业4
G 7 11.0 11.0 农业2
H 8 12.0 12.0 农业3
code | value | value2 | indstry | |
---|---|---|---|---|
A | 1 | 6.0 | 5.0 | 农业1 |
B | 2 | 5.0 | 6.0 | 农业1 |
C | 3 | 7.0 | 7.0 | 农业1 |
D | 4 | 8.0 | 10.0 | 农业2 |
E | 5 | 9.0 | 9.0 | 农业2 |
F | 6 | 10.0 | 10.0 | 农业4 |
G | 7 | 11.0 | 11.0 | 农业2 |
H | 8 | 12.0 | 12.0 | 农业3 |
思路三:
在缺失值周围建立模型,然后用模型提供的值进行填充。
下面进行任务处理1——对Age列的数据的缺失值进行处理
#对Age列的数据的缺失值进行处理
#方法一:删除所在行
#个人思路:
#1.先填充一个不可能出现的值,然后找到这个值的所在行索引,利用drop进行删除
train=pd.read_csv('train.csv')
train.head()
train_age1=train['Age'].isnull().value_counts()
print(train_age1)
train['Age']=train['Age'].fillna(9999)
train_age_index=train[(train['Age']==9999)].index.tolist()#找到所在行的索引
train=train.drop(train_age_index)
train.info()
#2.直接用dropna删除
train_age2=train.dropna(subset=['Age'])
train_age2
False 714
True 177
Name: Age, dtype: int64
<class 'pandas.core.frame.DataFrame'>
Int64Index: 714 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 714 non-null int64
Survived 714 non-null int64
Pclass 714 non-null int64
Name 714 non-null object
Sex 714 non-null object
Age 714 non-null float64
SibSp 714 non-null int64
Parch 714 non-null int64
Ticket 714 non-null object
Fare 714 non-null float64
Cabin 185 non-null object
Embarked 712 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 72.5+ KB
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
885 | 886 | 0 | 3 | Rice, Mrs. William (Margaret Norton) | female | 39.0 | 0 | 5 | 382652 | 29.1250 | NaN | Q |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
714 rows × 12 columns
#方法二:简单填充数值
#比如填充0
train=pd.read_csv('train.csv')
train.head()
train[train['Age'].isnull()]=0
train.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
#方法三:填充age的平均值,按男女进行分组
train=pd.read_csv('train.csv')
train.head()
# 需要处理的列
cols = ['Age']
# 分组的列:按男女进行分组
gp_col = 'Sex'
# 查询nan的列
train_na = train[cols].isna()
# 根据分组计算平均值
train_mean = train.groupby(gp_col)[cols].mean()
# 依次处理每一列
for col in cols:
na_series = train_na[col]
names = list(train.loc[na_series,gp_col])
t = train_mean.loc[names,col]
t.index = train.loc[na_series,col].index
# 相同的index进行赋值
train.loc[na_series,col] = t
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 891 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
任务处理2——使用不同的方法直接对整张表的缺失值进行处理
#再导入一遍
train=pd.read_csv('train.csv')
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
#直接删除有缺失值的行
train1=train.dropna()
train1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 183 entries, 1 to 889
Data columns (total 12 columns):
PassengerId 183 non-null int64
Survived 183 non-null int64
Pclass 183 non-null int64
Name 183 non-null object
Sex 183 non-null object
Age 183 non-null float64
SibSp 183 non-null int64
Parch 183 non-null int64
Ticket 183 non-null object
Fare 183 non-null float64
Cabin 183 non-null object
Embarked 183 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 18.6+ KB
#简单填充
train2=train.fillna(0).head()
train2
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | 0 | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | 0 | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | 0 | S |
#向下填充
train3=train.fillna(method='bfill')
train3.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | C85 | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | C123 | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | E46 | S |
#利用众数进行填充,因为个人觉得平均数好像不太靠谱,尤其是性别啥的
train4=train.fillna(train.mode().iloc[0])
train4.info()
train4.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 891 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 891 non-null object
Embarked 891 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | B96 B98 | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | B96 B98 | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | B96 B98 | S |
DataFrame.mode(axis=0, numeric_only=False)[source]
轴上可能存在多个众数,这也是为何此函数会返回一个 dataframe
参数:
axis : {0 or ‘index’, 1 or ‘columns’}, default 0
0 or ‘index’ : 获得列的众数
1 or ‘columns’ : 获得行的众数
numeric_only : boolean, default False 该项为 True 则只对数字列进行众数计算
众数填充的程序:
df.fillna(df.mode().iloc[0])
【思考1】dropna和fillna有哪些参数,分别如何使用呢?
dropna参数:
axis: default 0指行,1为列
how: {‘any’, ‘all’}, default ‘any’指带缺失值的所有行;'all’指清除全是缺失值的
thresh: int,保留含有int个非空值的行
subset: 对特定的列进行缺失值删除处理
inplace: 这个很常见,True表示直接在原数据上更改
fillna()函数参数:
inplace参数的取值:True、False
True:直接修改原对象
False:创建一个副本,修改副本,原对象不变(缺省默认)
method参数的取值 : {‘pad’, ‘ffill’,‘backfill’, ‘bfill’, None}, default None
pad/ffill:用前一个非缺失值去填充该缺失值
backfill/bfill:用下一个非缺失值填充该缺失值
None:指定一个值去替换缺失值(缺省默认这种方式)
limit参数:限制填充个数
axis参数:修改填充方向
df = pd.DataFrame({ 'a': [3,5,np.NaN,5,np.NaN,4,3,7],
'b': [6,7,8,9,0,2,3,np.NaN],
'c': [3,2,1,5,6,5,4,3]})
print(df)
df.dropna( subset=['a'], inplace=True )
print(df)
a b c
0 3.0 6.0 3
1 5.0 7.0 2
2 NaN 8.0 1
3 5.0 9.0 5
4 NaN 0.0 6
5 4.0 2.0 5
6 3.0 3.0 4
7 7.0 NaN 3
a b c
0 3.0 6.0 3
1 5.0 7.0 2
3 5.0 9.0 5
5 4.0 2.0 5
6 3.0 3.0 4
7 7.0 NaN 3
【思考2】检索空缺值用np.nan要比用None好,这是为什么?
数值列读取数据后,空缺值的数据类型为float64所以用None一般索引不到,比较的时候最好用np.nan
【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
2.2 重复值观察与处理
由于这样那样的原因,数据中会不会存在重复值呢,如果存在要怎样处理呢
2.2.1 任务一:请查看数据中的重复值
#写入代码
#再导入一遍
train=pd.read_csv('train.csv')
#方法1:
pd.concat([train.drop_duplicates(),train.drop_duplicates(keep=False)]).drop_duplicates(keep=False)
#方法2:
train[train.duplicated()]
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
2.2.2 任务二:对重复值进行处理
(1)重复值有哪些处理方式呢?
(2)处理我们数据的重复值
方法多多益善
#重复值有哪些处理方式:直接删除重复的行
#https://www.cnblogs.com/trotl/p/11876292.html
train.drop_duplicates().head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
2.2.3 任务三:将前面清洗的数据保存为csv格式
#写入代码
train.to_csv('test_clear.csv')
2.3 特征观察与处理
我们对特征进行一下观察,可以把特征大概分为两大类:
数值型特征:Survived ,Pclass, Age ,SibSp, Parch, Fare,其中Survived, Pclass为离散型数值特征,Age,SibSp, Parch, Fare为连续型数值特征
文本型特征:Name, Sex, Cabin,Embarked, Ticket,其中Sex, Cabin, Embarked, Ticket为类别型文本特征,数值型特征一般可以直接用于模型的训练,但有时候为了模型的稳定性及鲁棒性会对连续变量进行离散化。文本型特征往往需要转换成数值型特征才能用于建模分析。
2.3.1 任务一:对年龄进行分箱(离散化)处理
(1) 分箱操作是什么?
(2) 将连续变量Age平均分箱成5个年龄段,并分别用类别变量12345表示
(3) 将连续变量Age划分为[0,5) [5,15) [15,30) [30,50) [50,80)五个年龄段,并分别用类别变量12345表示
(4) 将连续变量Age按10% 30% 50 70% 90%五个年龄段,并用分类变量12345表示
(5) 将上面的获得的数据分别进行保存,保存为csv格式
分箱操作是什么:
数据分箱(也称为离散分箱或分段)是一种数据预处理技术,用于减少次要观察误差的影响,是一种将多个连续值分组为较少数量的“分箱”的方法。
一般在建立分类模型时,需要对连续变量离散化,特征离散化后,模型会更稳定,降低了模型过拟合的风险。比如在建立申请评分卡模型时用logsitic作为基模型就需要对连续变量进行离散化,离散化通常采用分箱法。分箱的有以下重要性及其优势:
#将连续变量Age平均分箱成5个年龄段,并分别用类别变量12345表示
train['AgeBand'] = pd.cut(train['Age'], 5,labels = ['1','2','3','4','5'])
train.to_csv('test_ave.csv')
train.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | AgeBand | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 2 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 3 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 2 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 3 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 3 |
#将连续变量Age划分为[0,5) [5,15) [15,30) [30,50) [50,80)五个年龄段,并分别用类别变量12345表示
train['AgeBand'] = pd.cut(train['Age'],[0,5,15,30,50,80],labels = ['1','2','3','4','5'])
train.to_csv('test_cut.csv')
train.head(3)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | AgeBand | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 3 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 4 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 3 |
#写入代码
#将连续变量Age按10% 30% 50 70% 90%五个年龄段,并用分类变量12345表示
train['AgeBand'] = pd.qcut(train['Age'],[0,0.1,0.3,0.5,0.7,0.9],labels = ['1','2','3','4','5'])
train.to_csv('test_pr.csv')
train.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | AgeBand | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 2 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 5 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 3 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 4 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 4 |
【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html
【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html
2.3.2 任务二:对文本变量进行转换
(1) 查看文本变量名及种类
(2) 将文本变量Sex, Cabin ,Embarked用数值变量12345表示
(3) 将文本变量Sex, Cabin, Embarked用one-hot编码表示
#查看类别文本变量名及种类
#方法一: value_counts
train['Sex'].value_counts()
train['Cabin'].value_counts()
train['Embarked'].value_counts()
#方法二: unique
train['Sex'].unique()
train['Sex'].nunique()
S 644
C 168
Q 77
Name: Embarked, dtype: int64
#将类别文本转换为12345
#方法一: replace
train['Sex_num'] = train['Sex'].replace(['male','female'],[1,2])
train.head()
#方法二: map
train['Sex_num'] = train['Sex'].map({'male': 1, 'female': 2})
train.head()
#方法三: 使用sklearn.preprocessing的LabelEncoder
from sklearn.preprocessing import LabelEncoder
for feat in ['Cabin', 'Ticket']:
lbl = LabelEncoder()
label_dict = dict(zip(train[feat].unique(), range(train[feat].nunique())))
train[feat + "_labelEncode"] = train[feat].map(label_dict)
train[feat + "_labelEncode"] = lbl.fit_transform(train[feat].astype(str))
train.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | AgeBand | Sex_num | Cabin_labelEncode | Ticket_labelEncode | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 2 | 1 | 147 | 523 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 5 | 2 | 81 | 596 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 3 | 2 | 147 | 669 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 4 | 2 | 55 | 49 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 4 | 1 | 147 | 472 |
#将类别文本转换为one-hot编码
#方法一: OneHotEncoder
for feat in ["Age", "Embarked"]:
# x = pd.get_dummies(df["Age"] // 6)
# x = pd.get_dummies(pd.cut(df['Age'],5))
x = pd.get_dummies(train[feat], prefix=feat)
train = pd.concat([train, x], axis=1)
#df[feat] = pd.get_dummies(df[feat], prefix=feat)
train.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | ... | Age_65.0 | Age_66.0 | Age_70.0 | Age_70.5 | Age_71.0 | Age_74.0 | Age_80.0 | Embarked_C | Embarked_Q | Embarked_S | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
5 rows × 107 columns
2.3.3 任务三:从纯文本Name特征里提取出Titles的特征(所谓的Titles就是Mr,Miss,Mrs等)
train['Title'] = train.Name.str.extract('([A-Za-z]+)\.', expand=False)
train.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | ... | Age_66.0 | Age_70.0 | Age_70.5 | Age_71.0 | Age_74.0 | Age_80.0 | Embarked_C | Embarked_Q | Embarked_S | Title | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | Mr |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | Mrs |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | Miss |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | Mrs |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | Mr |
5 rows × 108 columns
#保存最终你完成的已经清理好的数据
train.to_csv('test_fin.csv')