kaggle(01)-泰坦尼克号问题

经典又兼具备趣味性的Kaggle案例泰坦尼克号问题

大家都熟悉的『Jack and Rose』的故事,豪华游艇倒了,大家都惊恐逃生,可是救生艇的数量有限,无法人人都有,副船长发话了『lady and kid first!』,所以是否获救其实并非随机,而是基于一些背景有rank先后的。

训练和测试数据是一些乘客的个人信息以及存活状况,要尝试根据它生成合适的模型并预测其他人的存活状况。

对,这是一个二分类问题,很多分类算法都可以解决。

看看数据长什么样

还是用pandas加载数据

# 这个ipython notebook主要是我解决Kaggle Titanic问题的思路和过程
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus']=False

import pandas as pd #数据分析
import numpy as np #科学计算
from pandas import Series,DataFrame


第一步:读取数据并认识数据

data_train = pd.read_csv("Train.csv")  #从本地读取训练集
data_train.columns  #输出数据的属性列都有哪些
#data_train[data_train.Cabin.notnull()]['Survived'].value_counts()
Index([u'PassengerId', u'Survived', u'Pclass', u'Name', u'Sex', u'Age',
       u'SibSp', u'Parch', u'Ticket', u'Fare', u'Cabin', u'Embarked'],
      dtype='object')

我们看大概有以下这些字段

PassengerId => 乘客ID

Pclass => 乘客等级(1/2/3等舱位)

Name => 乘客姓名

Sex => 性别

Age => 年龄

SibSp => 堂兄弟/妹个数

Parch => 父母与小孩个数

Ticket => 船票信息

Fare => 票价

Cabin => 客舱

Embarked => 登船港口

我这么懒的人显然会让pandas自己先告诉我们一些信息

data_train.info()  #查看数据中每个属性的类别(数值型还是类别型)和是否含有缺失值
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

上面的数据说啥了?它告诉我们,训练数据中总共有891名乘客,但是很不幸,我们有些属性的数据不全,比如说:

  • Age(年龄)属性只有714名乘客有记录
  • Cabin(客舱)更是只有204名乘客是已知的

似乎信息略少啊,想再瞄一眼具体数据数值情况呢?恩,我们用下列的方法,得到数值型数据的一些分布(因为有些属性,比如姓名,是文本型;而另外一些属性,比如登船港口,是类目型。这些我们用下面的函数是看不到的)

data_train.describe()  #用于查看数值型的数据的统计信息,可以初略的看出数值型数据的一个大体的分布
PassengerIdSurvivedPclassAgeSibSpParchFare
count891.000000891.000000891.000000714.000000891.000000891.000000891.000000
mean446.0000000.3838382.30864229.6991180.5230080.38159432.204208
std257.3538420.4865920.83607114.5264971.1027430.80605749.693429
min1.0000000.0000001.0000000.4200000.0000000.0000000.000000
25%223.5000000.0000002.00000020.1250000.0000000.0000007.910400
50%446.0000000.0000003.00000028.0000000.0000000.00000014.454200
75%668.5000001.0000003.00000038.0000001.0000000.00000031.000000
max891.0000001.0000003.00000080.0000008.0000006.000000512.329200

mean字段告诉我们,大概0.383838的人最后获救了,2/3等舱的人数比1等舱要多,平均乘客年龄大概是29.7岁(计算这个时候会略掉无记录的)等等…

  • 『对数据的认识太重要了!』
  • 『对数据的认识太重要了!』
  • 『对数据的认识太重要了!』

口号喊完了,上面的简单描述信息并没有什么卵用啊,咱们得再细一点分析下数据啊。

看看每个/多个 属性和最后的Survived之间有着什么样的关系

第二步:通过对数据可视化,进一步认识数据

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(18,9))
fig.set(alpha=0.2)  # 设定图表颜色alpha参数

#总共只有5个数值型的属性,所以对这5个数值型数据与标签的相关性做一个可视化
plt.subplot2grid((2,3),(0,0))             # 在一张大图里分列几个小图,这里是第一行第一列的图
data_train.Survived.value_counts().plot(kind='bar')# plots a bar graph of those who surived vs those who did not. 
plt.title(u"获救情况 (1为获救)") # puts a title on our graph
plt.ylabel(u"人数")  

plt.subplot2grid((2,3),(0,1))  #这里是第一行第二列的图
data_train.Pclass.value_counts().plot(kind="bar")  #画出客仓等级的统计直方图
plt.ylabel(u"人数")
plt.title(u"乘客等级分布")

plt.subplot2grid((2,3),(0,2)) #这里是第一行第三列的图
plt.scatter(data_train.Survived, data_train.Age)  #画出存活与年龄的散点图
plt.ylabel(u"年龄")                         # sets the y axis lable
plt.grid(b=True, which='major', axis='y') # formats the grid line style of our graphs
plt.title(u"按年龄看获救分布 (1为获救)")


plt.subplot2grid((2,3),(1,0), colspan=2) #这里是第二行第一和二列的图
data_train.Age[data_train.Pclass == 1].plot(kind='kde')   # plots a kernel desnsity estimate of the subset of the 1st class passanges's age
data_train.Age[data_train.Pclass == 2].plot(kind='kde')
data_train.Age[data_train.Pclass == 3].plot(kind='kde')
plt.xlabel(u"年龄")# plots an axis lable
plt.ylabel(u"密度") 
plt.title(u"各等级的乘客年龄分布")
plt.legend((u'头等舱', u'2等舱',u'3等舱'),loc='best') # sets our legend for our graph.


plt.subplot2grid((2,3),(1,2)) #这里是第二行第二列的图
data_train.Embarked.value_counts().plot(kind='bar')
plt.title(u"各登船口岸上船人数")
plt.ylabel(u"人数")  
plt.show()

png

于是得到了像下面这样一张图:

bingo,图还是比数字好看多了。所以我们在图上可以看出来:

  • 被救的人300多点,不到半数;
  • 3等舱乘客灰常多;遇难和获救的人年龄似乎跨度都很广;
  • 3个不同的舱年龄总体趋势似乎也一致,2/3等舱乘客20岁多点的人最多,1等舱40岁左右的最多(→_→似乎符合财富和年龄的分配哈,咳咳,别理我,我瞎扯的);
  • 登船港口人数按照S、C、Q递减,而且S远多于另外俩港口。

这个时候我们可能会有一些想法了:

  1. 不同舱位/乘客等级可能和财富/地位有关系,最后获救概率可能会不一样
  2. 年龄对获救概率也一定是有影响的,毕竟前面说了,副船长还说『小孩和女士先走』呢
  3. 和登船港口是不是有关系呢?也许登船港口不同,人的出身地位不同?

口说无凭,空想无益。老老实实再来统计统计,看看这些属性值的统计分布吧。

#看看各乘客等级的获救情况
fig = plt.figure()
fig.set(alpha=0.2)  # 设定图表颜色alpha参数

Survived_0 = data_train.Pclass[data_train.Survived == 0].value_counts()  #将未获救的等级部分数据取出
Survived_1 = data_train.Pclass[data_train.Survived == 1].value_counts()  #将获救的等级部分数据取出
df=pd.DataFrame({u'获救':Survived_1, u'未获救':Survived_0}) #构造dataframe数据结构
df.plot(kind='bar', stacked=True)   #画出堆叠柱状图
plt.title(u"各乘客等级的获救情况")
plt.xlabel(u"乘客等级") 
plt.ylabel(u"人数") 

plt.show()
df
<Figure size 432x288 with 0 Axes>

png

未获救获救
180136
29787
3372119

得到这个图:

啧啧,果然,钱和地位对舱位有影响,进而对获救的可能性也有影响啊←_←

咳咳,跑题了,我想说的是,明显等级为1的乘客,获救的概率高很多。恩,这个一定是影响最后获救结果的一个特征。

#看看各登录港口的获救情况
fig = plt.figure()
fig.set(alpha=0.2)  # 设定图表颜色alpha参数
#将等船口类型按照是否获救进行拆分
Survived_0 = data_train.Embarked[data_train.Survived == 0].value_counts() 
Survived_1 = data_train.Embarked[data_train.Survived == 1].value_counts()
df=pd.DataFrame({u'获救':Survived_1, u'未获救':Survived_0})  #合并2个差分后的dataframe
df.plot(kind='bar', stacked=True)
plt.title(u"各登录港口乘客的获救情况")
plt.xlabel(u"登录港口") 
plt.ylabel(u"人数") 

plt.show()
df
<Figure size 432x288 with 0 Axes>

png

未获救获救
S427217
C7593
Q4730

并没有看出什么…

那个,看看性别好了

#看看各性别的获救情况
fig = plt.figure()
fig.set(alpha=0.2)  # 设定图表颜色alpha参数

#将年龄属性按照是否获救进行拆分
Survived_m = data_train.Survived[data_train.Sex == 'male'].value_counts()
Survived_f = data_train.Survived[data_train.Sex == 'female'].value_counts()
df=pd.DataFrame({u'男性':Survived_m, u'女性':Survived_f})  #将差分后的数据拼接成一个dataframe
df.plot(kind='bar', stacked=True)
plt.title(u"按性别看获救情况")
plt.xlabel(u"性别") 
plt.ylabel(u"人数")
plt.show()
df
<Figure size 432x288 with 0 Axes>

png

女性男性
081468
1233109

歪果盆友果然很尊重lady,lady first践行得不错。性别无疑也要作为重要特征加入最后的模型之中。

再来个详细版的好了

#然后我们再来看看各种舱级别情况下各性别的获救情况
fig=plt.figure(figsize=(12,5))
fig.set(alpha=0.65) # 设置图像透明度,无所谓
plt.title(u"根据舱等级和性别的获救情况")
#1-2等级的女性获救情况
ax1=fig.add_subplot(141)
data_train.Survived[data_train.Sex == 'female'][data_train.Pclass != 3].value_counts().plot(kind='bar', label="female highclass", color='#FA2479')
ax1.set_xticklabels([u"获救", u"未获救"], rotation=0)
ax1.legend([u"女性/高级舱"], loc='best')
#3等级的女性获救情况
ax2=fig.add_subplot(142, sharey=ax1)
data_train.Survived[data_train.Sex == 'female'][data_train.Pclass == 3].value_counts().plot(kind='bar', label='female, low class', color='pink')
ax2.set_xticklabels([u"未获救", u"获救"], rotation=0)
plt.legend([u"女性/低级舱"], loc='best')
#1-2等级男性获救情况
ax3=fig.add_subplot(143, sharey=ax1)
data_train.Survived[data_train.Sex == 'male'][data_train.Pclass != 3].value_counts().plot(kind='bar', label='male, high class',color='lightblue')
ax3.set_xticklabels([u"未获救", u"获救"], rotation=0)
plt.legend([u"男性/高级舱"], loc='best')
#3等级男性获救情况
ax4=fig.add_subplot(144, sharey=ax1)
data_train.Survived[data_train.Sex == 'male'][data_train.Pclass == 3].value_counts().plot(kind='bar', label='male low class', color='steelblue')
ax4.set_xticklabels([u"未获救", u"获救"], rotation=0)
plt.legend([u"男性/低级舱"], loc='best')

plt.show()

png

那堂兄弟和父母呢?
大家族会有优势么?

g = data_train.groupby(['SibSp','Survived'])  #将属性SibSp','Survived'组合
df = pd.DataFrame(g.count()['PassengerId'])   
df
PassengerId
SibSpSurvived
00398
1210
1097
1112
2015
113
3012
14
4015
13
505
807
g = data_train.groupby(['Parch','Survived'])
df = pd.DataFrame(g.count()['PassengerId'])
df
PassengerId
ParchSurvived
00445
1233
1053
165
2040
140
302
13
404
504
11
601

好吧,没看出特别特别明显的规律(为自己的智商感到捉急…),先作为备选特征,放一放。

看看船票好了

ticket是船票编号,应该是unique的,和最后的结果没有太大的关系,不纳入考虑的特征范畴

cabin只有204个乘客有值,我们先看看它的一个分布

#ticket是船票编号,应该是unique的,和最后的结果没有太大的关系,不纳入考虑的特征范畴
#cabin只有204个乘客有值,我们先看看它的一个分布
data_train.Cabin.value_counts()  #对船票这个属性进行统计
C23 C25 C27        4
G6                 4
B96 B98            4
D                  3
C22 C26            3
E101               3
F2                 3
F33                3
B57 B59 B63 B66    2
C68                2
B58 B60            2
E121               2
D20                2
E8                 2
E44                2
B77                2
C65                2
D26                2
E24                2
E25                2
B20                2
C93                2
D33                2
E67                2
D35                2
D36                2
C52                2
F4                 2
C125               2
C124               2
                  ..
F G63              1
A6                 1
D45                1
D6                 1
D56                1
C101               1
C54                1
D28                1
D37                1
B102               1
D30                1
E17                1
E58                1
F E69              1
D10 D12            1
E50                1
A14                1
C91                1
A16                1
B38                1
B39                1
C95                1
B78                1
B79                1
C99                1
B37                1
A19                1
E12                1
A7                 1
D15                1
Name: Cabin, Length: 147, dtype: int64

这三三两两的…如此不集中…我们猜一下,也许,前面的ABCDE是指的甲板位置、然后编号是房间号?…好吧,我瞎说的,别当真…

关键是Cabin这鬼属性,应该算作类目型的,本来缺失值就多,还如此不集中,注定是个棘手货…第一感觉,这玩意儿如果直接按照类目特征处理的话,太散了,估计每个因子化后的特征都拿不到什么权重。加上有那么多缺失值,要不我们先把Cabin缺失与否作为条件(虽然这部分信息缺失可能并非未登记,maybe只是丢失了而已,所以这样做未必妥当),先在有无Cabin信息这个粗粒度上看看Survived的情况好了。

#cabin的值计数太分散了,绝大多数Cabin值只出现一次。感觉上作为类目,加入特征未必会有效
#那我们一起看看这个值的有无,对于survival的分布状况,影响如何吧
fig = plt.figure()
fig.set(alpha=0.2)  # 设定图表颜色alpha参数

Survived_cabin = data_train.Survived[pd.notnull(data_train.Cabin)].value_counts()
Survived_nocabin = data_train.Survived[pd.isnull(data_train.Cabin)].value_counts()
df=pd.DataFrame({u'有':Survived_cabin, u'无':Survived_nocabin}).transpose()
df.plot(kind='bar', stacked=True)
plt.title(u"按Cabin有无看获救情况")
plt.xlabel(u"Cabin有无") 
plt.ylabel(u"人数")
plt.show()
df

#似乎有cabin记录的乘客survival比例稍高,那先试试把这个值分为两类,有cabin值/无cabin值,一会儿加到类别特征好了
<Figure size 432x288 with 0 Axes>

png

01
481206
68136

有Cabin记录的似乎获救概率稍高一些,先这么着放一放吧。

先从最突出的数据属性开始吧,对,Cabin和Age,有丢失数据实在是对下一步工作影响太大。

先说Cabin,暂时我们就按照刚才说的,按Cabin有无数据,将这个属性处理成Yes和No两种类型吧。

再说Age:

通常遇到缺值的情况,我们会有几种常见的处理方式

  1. 如果缺值的样本占总数比例极高,我们可能就直接舍弃了,作为特征加入的话,可能反倒带入noise,影响最后的结果了
  2. 如果缺值的样本适中,而该属性非连续值特征属性(比如说类目属性),那就把NaN作为一个新类别,加到类别特征中
  3. 如果缺值的样本适中,而该属性为连续值特征属性,有时候我们会考虑给定一个step(比如这里的age,我们可以考虑每隔2/3岁为一个步长),然后把它离散化,之后把NaN作为一个type加到属性类目中。
  4. 有些情况下,缺失的值个数并不是特别多,那我们也可以试着根据已有的值,拟合一下数据,补充上。

    本例中,后两种处理方式应该都是可行的,我们先试试拟合补全吧(虽然说没有特别多的背景可供我们拟合,这不一定是一个多么好的选择)

我们这里用scikit-learn中的RandomForest来拟合一下缺失的年龄数据

第三步:对数据预处理

## 缺失值处理

from sklearn.ensemble import RandomForestRegressor
 
### 使用 RandomForestClassifier 填补缺失的年龄属性
def set_missing_ages(df):
    
    # 把已有的数值型特征取出来丢进Random Forest Regressor中
    age_df = df[['Age','Fare', 'Parch', 'SibSp', 'Pclass']]

    # 乘客分成已知年龄和未知年龄两部分
    known_age = age_df[age_df.Age.notnull()].as_matrix()  #当做训练集的部分样本
    unknown_age = age_df[age_df.Age.isnull()].as_matrix()  #要预测的部分样本

    # y即目标年龄
    y = known_age[:, 0]   

    # X即特征属性值
    X = known_age[:, 1:]  #取后面的4列属性作为训练集

    # fit到RandomForestRegressor之中
    rfr = RandomForestRegressor(random_state=0, n_estimators=2000, n_jobs=-1)
    rfr.fit(X, y)
    
    # 用得到的模型进行未知年龄结果预测
    predictedAges = rfr.predict(unknown_age[:, 1::]) 
    
    # 用得到的预测结果填补原缺失数据
    df.loc[ (df.Age.isnull()), 'Age' ] = predictedAges   
    
    return df, rfr

#将船票属性进行二值化
def set_Cabin_type(df):
    df.loc[ (df.Cabin.notnull()), 'Cabin' ] = "Yes"   
    df.loc[ (df.Cabin.isnull()), 'Cabin' ] = "No"
    return df

data_train, rfr = set_missing_ages(data_train)
data_train = set_Cabin_type(data_train)
data_train
F:\ancoda\soft\envs\py27\lib\site-packages\ipykernel_launcher.py:10: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  # Remove the CWD from sys.path while we load stuff.
F:\ancoda\soft\envs\py27\lib\site-packages\ipykernel_launcher.py:11: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  # This is added back by InteractiveShellApp.init_path()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.00000010A/5 211717.2500NoS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.00000010PC 1759971.2833YesC
2313Heikkinen, Miss. Lainafemale26.00000000STON/O2. 31012827.9250NoS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.0000001011380353.1000YesS
4503Allen, Mr. William Henrymale35.000000003734508.0500NoS
5603Moran, Mr. Jamesmale23.838953003308778.4583NoQ
6701McCarthy, Mr. Timothy Jmale54.000000001746351.8625YesS
7803Palsson, Master. Gosta Leonardmale2.0000003134990921.0750NoS
8913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.0000000234774211.1333NoS
91012Nasser, Mrs. Nicholas (Adele Achem)female14.0000001023773630.0708NoC
101113Sandstrom, Miss. Marguerite Rutfemale4.00000011PP 954916.7000YesS
111211Bonnell, Miss. Elizabethfemale58.0000000011378326.5500YesS
121303Saundercock, Mr. William Henrymale20.00000000A/5. 21518.0500NoS
131403Andersson, Mr. Anders Johanmale39.0000001534708231.2750NoS
141503Vestrom, Miss. Hulda Amanda Adolfinafemale14.000000003504067.8542NoS
151612Hewlett, Mrs. (Mary D Kingcome)female55.0000000024870616.0000NoS
161703Rice, Master. Eugenemale2.0000004138265229.1250NoQ
171812Williams, Mr. Charles Eugenemale32.0664930024437313.0000NoS
181903Vander Planke, Mrs. Julius (Emelia Maria Vande...female31.0000001034576318.0000NoS
192013Masselmani, Mrs. Fatimafemale29.5182050026497.2250NoC
202102Fynney, Mr. Joseph Jmale35.0000000023986526.0000NoS
212212Beesley, Mr. Lawrencemale34.0000000024869813.0000YesS
222313McGowan, Miss. Anna "Annie"female15.000000003309238.0292NoQ
232411Sloper, Mr. William Thompsonmale28.0000000011378835.5000YesS
242503Palsson, Miss. Torborg Danirafemale8.0000003134990921.0750NoS
252613Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...female38.0000001534707731.3875NoS
262703Emir, Mr. Farred Chehabmale29.5182050026317.2250NoC
272801Fortune, Mr. Charles Alexandermale19.0000003219950263.0000YesS
282913O'Dwyer, Miss. Ellen "Nellie"female22.380113003309597.8792NoQ
293003Todoroff, Mr. Laliomale27.947206003492167.8958NoS
.......................................
86186202Giles, Mr. Frederick Edwardmale21.000000102813411.5000NoS
86286311Swift, Mrs. Frederick Joel (Margaret Welles Ba...female48.000000001746625.9292YesS
86386403Sage, Miss. Dorothy Edith "Dolly"female10.86986782CA. 234369.5500NoS
86486502Gill, Mr. John Williammale24.0000000023386613.0000NoS
86586612Bystrom, Mrs. (Karolina)female42.0000000023685213.0000NoS
86686712Duran y More, Miss. Asuncionfemale27.00000010SC/PARIS 214913.8583NoC
86786801Roebling, Mr. Washington Augustus IImale31.00000000PC 1759050.4958YesS
86886903van Melkebeke, Mr. Philemonmale25.977889003457779.5000NoS
86987013Johnson, Master. Harold Theodormale4.0000001134774211.1333NoS
87087103Balkic, Mr. Cerinmale26.000000003492487.8958NoS
87187211Beckwith, Mrs. Richard Leonard (Sallie Monypeny)female47.000000111175152.5542YesS
87287301Carlsson, Mr. Frans Olofmale33.000000006955.0000YesS
87387403Vander Cruyssen, Mr. Victormale47.000000003457659.0000NoS
87487512Abelson, Mrs. Samuel (Hannah Wizosky)female28.00000010P/PP 338124.0000NoC
87587613Najib, Miss. Adele Kiamie "Jane"female15.0000000026677.2250NoC
87687703Gustafsson, Mr. Alfred Ossianmale20.0000000075349.8458NoS
87787803Petroff, Mr. Nedeliomale19.000000003492127.8958NoS
87887903Laleff, Mr. Kristomale27.947206003492177.8958NoS
87988011Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)female56.000000011176783.1583YesC
88088112Shelley, Mrs. William (Imanita Parrish Hall)female25.0000000123043326.0000NoS
88188203Markun, Mr. Johannmale33.000000003492577.8958NoS
88288303Dahlberg, Miss. Gerda Ulrikafemale22.00000000755210.5167NoS
88388402Banfield, Mr. Frederick Jamesmale28.00000000C.A./SOTON 3406810.5000NoS
88488503Sutehall, Mr. Henry Jrmale25.00000000SOTON/OQ 3920767.0500NoS
88588603Rice, Mrs. William (Margaret Norton)female39.0000000538265229.1250NoQ
88688702Montvila, Rev. Juozasmale27.0000000021153613.0000NoS
88788811Graham, Miss. Margaret Edithfemale19.0000000011205330.0000YesS
88888903Johnston, Miss. Catherine Helen "Carrie"female16.19395012W./C. 660723.4500NoS
88989011Behr, Mr. Karl Howellmale26.0000000011136930.0000YesC
89089103Dooley, Mr. Patrickmale32.000000003703767.7500NoQ

891 rows × 12 columns

因为逻辑回归建模时,需要输入的特征都是数值型特征,我们通常会先对类目型的特征因子化/one-hot编码。

什么叫做因子化/one-hot编码?举个例子:

以Embarked为例,原本一个属性维度,因为其取值可以是[‘S’,’C’,’Q‘],而将其平展开为’Embarked_C’,’Embarked_S’, ‘Embarked_Q’三个属性

  • 原本Embarked取值为S的,在此处的”Embarked_S”下取值为1,在’Embarked_C’, ‘Embarked_Q’下取值为0
  • 原本Embarked取值为C的,在此处的”Embarked_C”下取值为1,在’Embarked_S’, ‘Embarked_Q’下取值为0
  • 原本Embarked取值为Q的,在此处的”Embarked_Q”下取值为1,在’Embarked_C’, ‘Embarked_S’下取值为0

我们使用pandas的”get_dummies”来完成这个工作,并拼接在原来的”data_train”之上,如下所示。

# 因为逻辑回归建模时,需要输入的特征都是数值型特征
# 我们先对类目型的特征离散/因子化
# 以Cabin为例,原本一个属性维度,因为其取值可以是['yes','no'],而将其平展开为'Cabin_yes','Cabin_no'两个属性
# 原本Cabin取值为yes的,在此处的'Cabin_yes'下取值为1,在'Cabin_no'下取值为0
# 原本Cabin取值为no的,在此处的'Cabin_yes'下取值为0,在'Cabin_no'下取值为1
# 我们使用pandas的get_dummies来完成这个工作,并拼接在原来的data_train之上,如下所示
#对于类别型数据的处理
dummies_Cabin = pd.get_dummies(data_train['Cabin'], prefix= 'Cabin')  #对船票这个属性进行0ne-hot编码

dummies_Embarked = pd.get_dummies(data_train['Embarked'], prefix= 'Embarked')

dummies_Sex = pd.get_dummies(data_train['Sex'], prefix= 'Sex')

dummies_Pclass = pd.get_dummies(data_train['Pclass'], prefix= 'Pclass')

df = pd.concat([data_train, dummies_Cabin, dummies_Embarked, dummies_Sex, dummies_Pclass], axis=1)  #对dataframe按照列来拼接
df.drop(['Pclass', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], axis=1, inplace=True)
df
PassengerIdSurvivedAgeSibSpParchFareCabin_NoCabin_YesEmbarked_CEmbarked_QEmbarked_SSex_femaleSex_malePclass_1Pclass_2Pclass_3
01022.000000107.25001000101001
12138.0000001071.28330110010100
23126.000000007.92501000110001
34135.0000001053.10000100110100
45035.000000008.05001000101001
56023.838953008.45831001001001
67054.0000000051.86250100101100
7802.0000003121.07501000101001
89127.0000000211.13331000110001
910114.0000001030.07081010010010
101114.0000001116.70000100110001
1112158.0000000026.55000100110100
1213020.000000008.05001000101001
1314039.0000001531.27501000101001
1415014.000000007.85421000110001
1516155.0000000016.00001000110010
161702.0000004129.12501001001001
1718132.0664930013.00001000101010
1819031.0000001018.00001000110001
1920129.518205007.22501010010001
2021035.0000000026.00001000101010
2122134.0000000013.00000100101010
2223115.000000008.02921001010001
2324128.0000000035.50000100101100
242508.0000003121.07501000110001
2526138.0000001531.38751000110001
2627029.518205007.22501010001001
2728019.00000032263.00000100101100
2829122.380113007.87921001010001
2930027.947206007.89581000101001
...................................................
861862021.0000001011.50001000101010
862863148.0000000025.92920100110100
863864010.8698678269.55001000110001
864865024.0000000013.00001000101010
865866142.0000000013.00001000110010
866867127.0000001013.85831010010010
867868031.0000000050.49580100101100
868869025.977889009.50001000101001
86987014.0000001111.13331000101001
870871026.000000007.89581000101001
871872147.0000001152.55420100110100
872873033.000000005.00000100101100
873874047.000000009.00001000101001
874875128.0000001024.00001010010010
875876115.000000007.22501010010001
876877020.000000009.84581000101001
877878019.000000007.89581000101001
878879027.947206007.89581000101001
879880156.0000000183.15830110010100
880881125.0000000126.00001000110010
881882033.000000007.89581000101001
882883022.0000000010.51671000110001
883884028.0000000010.50001000101010
884885025.000000007.05001000101001
885886039.0000000529.12501001010001
886887027.0000000013.00001000101010
887888119.0000000030.00000100110100
888889016.1939501223.45001000110001
889890126.0000000030.00000110001100
890891032.000000007.75001001001001

891 rows × 16 columns

我们还得做一些处理,仔细看看Age和Fare两个属性,乘客的数值幅度变化,也忒大了吧!!如果大家了解逻辑回归与梯度下降的话,会知道,各属性值之间scale差距太大,将对收敛速度造成几万点伤害值!甚至不收敛! (╬▔皿▔)…所以我们先用scikit-learn里面的preprocessing模块对这俩货做一个scaling,所谓scaling,其实就是将一些变化幅度较大的特征化到[-1,1]之内。

# 接下来我们要接着做一些数据预处理的工作,比如scaling,将一些变化幅度较大的特征化到[-1,1]之内
# 这样可以加速logistic regression的收敛
#对数值型数据进行处理
import sklearn.preprocessing as preprocessing
scaler = preprocessing.StandardScaler()  #对取值较大或者取值范围较大的数值型特征进行标准化(均值为0,方差为1的正太分布)
age_scale_param = scaler.fit(np.array(df["Age"]).reshape((-1,1)))  #需要注意插入的必须是一维的numpy数组,而不是serials
df['Age_scaled'] = scaler.fit_transform(np.array(df["Age"]).reshape((-1,1)), age_scale_param)
fare_scale_param = scaler.fit(np.array(df["Fare"]).reshape((-1,1)))
df['Fare_scaled'] = scaler.fit_transform(np.array(df["Fare"]).reshape((-1,1)), fare_scale_param)
df
PassengerIdSurvivedAgeSibSpParchFareCabin_NoCabin_YesEmbarked_CEmbarked_QEmbarked_SSex_femaleSex_malePclass_1Pclass_2Pclass_3Age_scaledFare_scaled
01022.000000107.25001000101001-0.561380-0.502445
12138.0000001071.283301100101000.6131710.786845
23126.000000007.92501000110001-0.267742-0.488854
34135.0000001053.100001001101000.3929420.420730
45035.000000008.050010001010010.392942-0.486337
56023.838953008.45831001001001-0.426384-0.478116
67054.0000000051.862501001011001.7877220.395814
7802.0000003121.07501000101001-2.029569-0.224083
89127.0000000211.13331000110001-0.194333-0.424256
910114.0000001030.07081010010010-1.148655-0.042956
101114.0000001116.70000100110001-1.882750-0.312172
1112158.0000000026.550001001101002.081359-0.113846
1213020.000000008.05001000101001-0.708199-0.486337
1314039.0000001531.275010001010010.686580-0.018709
1415014.000000007.85421000110001-1.148655-0.490280
1516155.0000000016.000010001100101.861131-0.326267
161702.0000004129.12501001001001-2.029569-0.061999
1718132.0664930013.000010001010100.177595-0.386671
1819031.0000001018.000010001100010.099305-0.285997
1920129.518205007.22501010010001-0.009473-0.502949
2021035.0000000026.000010001010100.392942-0.124920
2122134.0000000013.000001001010100.319533-0.386671
2223115.000000008.02921001010001-1.075246-0.486756
2324128.0000000035.50000100101100-0.1209240.066360
242508.0000003121.07501000110001-1.589112-0.224083
2526138.0000001531.387510001100010.613171-0.016444
2627029.518205007.22501010001001-0.009473-0.502949
2728019.00000032263.00000100101100-0.7816084.647001
2829122.380113007.87921001010001-0.533476-0.489776
2930027.947206007.89581000101001-0.124799-0.489442
.........................................................
861862021.0000001011.50001000101010-0.634790-0.416873
862863148.0000000025.929201001101001.347265-0.126345
863864010.8698678269.55001000110001-1.3784370.751946
864865024.0000000013.00001000101010-0.414561-0.386671
865866142.0000000013.000010001100100.906808-0.386671
866867127.0000001013.85831010010010-0.194333-0.369389
867868031.0000000050.495801001011000.0993050.368295
868869025.977889009.50001000101001-0.269366-0.457142
86987014.0000001111.13331000101001-1.882750-0.424256
870871026.000000007.89581000101001-0.267742-0.489442
871872147.0000001152.554201001101001.2738560.409741
872873033.000000005.000001001011000.246124-0.547748
873874047.000000009.000010001010011.273856-0.467209
874875128.0000001024.00001010010010-0.120924-0.165189
875876115.000000007.22501010010001-1.075246-0.502949
876877020.000000009.84581000101001-0.708199-0.450180
877878019.000000007.89581000101001-0.781608-0.489442
878879027.947206007.89581000101001-0.124799-0.489442
879880156.0000000183.158301100101001.9345401.025945
880881125.0000000126.00001000110010-0.341152-0.124920
881882033.000000007.895810001010010.246124-0.489442
882883022.0000000010.51671000110001-0.561380-0.436671
883884028.0000000010.50001000101010-0.120924-0.437007
884885025.000000007.05001000101001-0.341152-0.506472
885886039.0000000529.125010010100010.686580-0.061999
886887027.0000000013.00001000101010-0.194333-0.386671
887888119.0000000030.00000100110100-0.781608-0.044381
888889016.1939501223.45001000110001-0.987599-0.176263
889890126.0000000030.00000110001100-0.267742-0.044381
890891032.000000007.750010010010010.172714-0.492378

891 rows × 18 columns

我们把需要的feature字段取出来,转成numpy格式,使用scikit-learn中的LogisticRegression建模。

第四步:baseline模型训练

# 我们把需要的feature字段取出来,转成numpy格式,使用scikit-learn中的LogisticRegression建模
from sklearn import linear_model

train_df = df.filter(regex='Survived|Age_.*|SibSp|Parch|Fare_.*|Cabin_.*|Embarked_.*|Sex_.*|Pclass_.*')  #将处理后的特征选出
train_np = train_df.as_matrix()  #将dataframe数据结构转换为matrix,以便输入到模型中进行训练
# train_np
# y即Survival结果
y_final = train_np[:, 0]  #第0列保存的是存活数据,

# X即特征属性值
X_final = train_np[:, 1:]

# fit到RandomForestRegressor之中
clf = linear_model.LogisticRegression(C=1.0, penalty='l1', tol=1e-6)  #使用带L1正则化项的LR模型
clf.fit(X_final, y_final)
    
clf
F:\ancoda\soft\envs\py27\lib\site-packages\ipykernel_launcher.py:5: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  """





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=1e-06,
          verbose=0, warm_start=False)

接下来咱们对测试集做和训练集一样的操作

data_test = pd.read_csv("test.csv")
##对缺失值处理
data_test.loc[ (data_test.Fare.isnull()), 'Fare' ] = 0  #将工资属性中缺失值填充0
# 接着我们对test_data做和train_data中一致的特征变换
# 首先用同样的RandomForestRegressor模型填上丢失的年龄
tmp_df = data_test[['Age','Fare', 'Parch', 'SibSp', 'Pclass']]
null_age = tmp_df[data_test.Age.isnull()].as_matrix() #取出年龄中的缺失值,并转换为矩阵方便输入模型
# 根据特征属性X预测年龄并补上
X = null_age[:, 1:]  #将含年龄缺失值的包含其他4种数值型属性的数据取出
predictedAges = rfr.predict(X)  #用训练集中训练好的RF模型进行缺失值的填充
data_test.loc[ (data_test.Age.isnull()), 'Age' ] = predictedAges

##对类别型数据处理(one-hot)
data_test = set_Cabin_type(data_test)
dummies_Cabin = pd.get_dummies(data_test['Cabin'], prefix= 'Cabin')
dummies_Embarked = pd.get_dummies(data_test['Embarked'], prefix= 'Embarked')
dummies_Sex = pd.get_dummies(data_test['Sex'], prefix= 'Sex')
dummies_Pclass = pd.get_dummies(data_test['Pclass'], prefix= 'Pclass')

df_test = pd.concat([data_test, dummies_Cabin, dummies_Embarked, dummies_Sex, dummies_Pclass], axis=1)
df_test.drop(['Pclass', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], axis=1, inplace=True)
## 对数值型数据进行缩放
df_test['Age_scaled'] = scaler.fit_transform(np.array(df_test['Age']).reshape(-1,1), age_scale_param)
df_test['Fare_scaled'] = scaler.fit_transform(np.array(df_test['Fare']).reshape(-1,1), fare_scale_param)
df_test
F:\ancoda\soft\envs\py27\lib\site-packages\ipykernel_launcher.py:7: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  import sys
PassengerIdAgeSibSpParchFareCabin_NoCabin_YesEmbarked_CEmbarked_QEmbarked_SSex_femaleSex_malePclass_1Pclass_2Pclass_3Age_scaledFare_scaled
089234.500000007.829210010010010.307521-0.496637
189347.000000107.000010001100011.256241-0.511497
289462.000000009.687510010010102.394706-0.463335
389527.000000008.66251000101001-0.261711-0.481704
489622.0000001112.28751000110001-0.641199-0.416740
589714.000000009.22501000101001-1.248380-0.471623
689830.000000007.62921001010001-0.034018-0.500221
789926.0000001129.00001000101010-0.337609-0.117238
890018.000000007.22921010010001-0.944790-0.507390
990121.0000002024.15001000101001-0.717097-0.204154
1090227.947206007.89581000101001-0.189820-0.495444
1190346.0000000026.000010001011001.180344-0.171000
1290423.0000001082.26670100110100-0.5653010.837349
1390563.0000001026.000010001010102.470603-0.171000
1490647.0000001061.175001001101001.2562410.459367
1590724.0000001027.72081010010010-0.489404-0.140162
1690835.0000000012.350010010010100.345470-0.415620
1790921.000000007.22501010001001-0.717097-0.507465
1891027.000000107.92501000110001-0.261711-0.494920
1991145.000000007.225010100100011.104446-0.507465
2091255.0000001059.400010100011001.8634220.427557
219139.000000013.17081000101001-1.627868-0.580120
2291452.3143110031.683310001101001.659585-0.069151
2391521.0000000161.37921010001100-0.7170970.463026
2491648.00000013262.375001100101001.3321394.065049
2591750.0000001014.500010001010011.483934-0.377090
2691822.0000000161.97920110010100-0.6411990.473779
2791922.500000007.22501010001001-0.603250-0.507465
2892041.0000000030.500001001011000.800856-0.090356
2992123.4596832021.67921010001001-0.530413-0.248433
......................................................
388128021.000000007.75001001001001-0.717097-0.498056
38912816.0000003121.07501000101001-1.855561-0.259261
390128223.0000000093.50000100101100-0.5653011.038659
391128351.0000000139.400001001101001.5598320.069140
392128413.0000000220.25001000101001-1.324278-0.274045
393128547.0000000010.500010001010101.256241-0.448774
394128629.0000003122.02501000101001-0.109916-0.242236
395128718.0000001060.00000100110100-0.9447900.438310
396128824.000000007.25001001001001-0.489404-0.507017
397128948.0000001179.200001100101001.3321390.782391
398129022.000000007.77501000101001-0.641199-0.497608
399129131.000000007.733310010010010.041880-0.498356
400129230.00000000164.86670100110100-0.0340182.317614
401129338.0000001021.000010001010100.573163-0.260605
402129422.0000000159.40001010010100-0.6411990.427557
403129517.0000000047.10001000101100-1.0206870.207130
404129643.0000001027.720801100011000.952651-0.140162
405129720.0000000013.86250110001010-0.792994-0.388515
406129823.0000001010.50001000101010-0.565301-0.448774
407129950.00000011211.500001100011001.4839343.153324
408130019.895581007.72081001010001-0.800919-0.498580
40913013.0000001113.77501000110001-2.083254-0.390083
410130235.295824007.750010010100010.367922-0.498056
411130337.0000001090.000001010101000.4972650.975936
412130428.000000007.77501000110001-0.185813-0.497608
413130530.705727008.050010001010010.019545-0.492680
414130639.00000000108.900001100101000.6490611.314641
415130738.500000007.250010001010010.611112-0.507017
416130830.705727008.050010001010010.019545-0.492680
417130925.7833771122.35831010001001-0.354050-0.236263

418 rows × 17 columns

test = df_test.filter(regex='Age_.*|SibSp|Parch|Fare_.*|Cabin_.*|Embarked_.*|Sex_.*|Pclass_.*')  #取出处理后的特征
predictions = clf.predict(test)  #对结构进行预测
result = pd.DataFrame({'PassengerId':data_test['PassengerId'].as_matrix(), 'Survived':predictions.astype(np.int32)})  #按照ID将结果结构化
result.to_csv("logistic_regression_predictions.csv", index=False)  #将实验结果进行保存,方便提交
F:\ancoda\soft\envs\py27\lib\site-packages\ipykernel_launcher.py:3: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  This is separate from the ipykernel package so we can avoid doing imports until
pd.read_csv("logistic_regression_predictions.csv")
PassengerIdSurvived
08920
18930
28940
38950
48961
58970
68981
78990
89001
99010
109020
119030
129041
139050
149061
159071
169080
179090
189101
199111
209120
219130
229141
239150
249161
259170
269181
279190
289200
299210
.........
38812800
38912810
39012821
39112831
39212840
39312850
39412860
39512871
39612880
39712891
39812900
39912910
40012921
40112930
40212941
40312950
40412960
40512971
40612980
40712990
40813001
40913011
41013021
41113031
41213041
41313050
41413061
41513070
41613080
41713090

418 rows × 2 columns

0.76555,恩,结果还不错。毕竟,这只是我们简单分析过后出的一个baseline系统嘛

第五步:模型的优化

要判定一下当前模型所处状态(欠拟合or过拟合)

有一个很可能发生的问题是,我们不断地做feature engineering,产生的特征越来越多,用这些特征去训练模型,会对我们的训练集拟合得越来越好,同时也可能在逐步丧失泛化能力,从而在待预测的数据上,表现不佳,也就是发生过拟合问题。

从另一个角度上说,如果模型在待预测的数据上表现不佳,除掉上面说的过拟合问题,也有可能是欠拟合问题,也就是说在训练集上,其实拟合的也不是那么好。

额,这个欠拟合和过拟合怎么解释呢。这么说吧:

  1. 过拟合就像是你班那个学数学比较刻板的同学,老师讲过的题目,一字不漏全记下来了,于是老师再出一样的题目,分分钟精确出结果。but数学考试,因为总是碰到新题目,所以成绩不咋地。
  2. 欠拟合就像是,咳咳,和博主level差不多的差生。连老师讲的练习题也记不住,于是连老师出一样题目复习的周测都做不好,考试更是可想而知了。

而在机器学习的问题上,对于过拟合和欠拟合两种情形。我们优化的方式是不同的。

对过拟合而言,通常以下策略对结果优化是有用的:

  • 做一下feature selection,挑出较好的feature的subset来做training
  • 提供更多的数据,从而弥补原始数据的bias问题,学习到的model也会更准确

而对于欠拟合而言,我们通常需要更多的feature,更复杂的模型来提高准确度。

著名的learning curve可以帮我们判定我们的模型现在所处的状态。我们以样本数为横坐标,训练和交叉验证集上的错误率作为纵坐标,两种状态分别如下两张图所示:过拟合(overfitting/high variace),欠拟合(underfitting/high bias)


著名的learning curve可以帮我们判定我们的模型现在所处的状态。我们以样本数为横坐标,训练和交叉验证集上的错误率作为纵坐标,两种状态分别如下两张图所示:过拟合(overfitting/high variace),欠拟合(underfitting/high bias)

我们也可以把错误率替换成准确率(得分),得到另一种形式的learning curve(sklearn 里面是这么做的)。

回到我们的问题,我们用scikit-learn里面的learning_curve来帮我们分辨我们模型的状态。举个例子,这里我们一起画一下我们最先得到的baseline model的learning curve。

##通过学习曲线来判断模型所处的状态
import numpy as np
import matplotlib.pyplot as plt
from sklearn.learning_curve import learning_curve

# 用sklearn的learning_curve得到training_score和cv_score,使用matplotlib画出learning curve
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None, n_jobs=1, 
                        train_sizes=np.linspace(.05, 1., 20), verbose=0, plot=True):
    """
    画出data在某模型上的learning curve.
    参数解释
    ----------
    estimator : 你用的分类器。
    title : 表格的标题。
    X : 输入的feature,numpy类型
    y : 输入的target vector
    ylim : tuple格式的(ymin, ymax), 设定图像中纵坐标的最低点和最高点
    cv : 做cross-validation的时候,数据分成的份数,其中一份作为cv集,其余n-1份作为training(默认为3份)
    n_jobs : 并行的的任务数(默认1)
    """
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes, verbose=verbose)
    
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    
    if plot:
        plt.figure()
        plt.title(title)
        if ylim is not None:
            plt.ylim(*ylim)
        plt.xlabel(u"训练样本数")
        plt.ylabel(u"得分")
        plt.gca().invert_yaxis()
        plt.grid()
    
        plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, 
                         alpha=0.1, color="b")
        plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, 
                         alpha=0.1, color="r")
        plt.plot(train_sizes, train_scores_mean, 'o-', color="b", label=u"训练集上得分")
        plt.plot(train_sizes, test_scores_mean, 'o-', color="r", label=u"交叉验证集上得分")
    
        plt.legend(loc="best")
        
        plt.draw()
        plt.gca().invert_yaxis()
        plt.show()
    
    midpoint = ((train_scores_mean[-1] + train_scores_std[-1]) + (test_scores_mean[-1] - test_scores_std[-1])) / 2
    diff = (train_scores_mean[-1] + train_scores_std[-1]) - (test_scores_mean[-1] - test_scores_std[-1])
    return midpoint, diff

plot_learning_curve(clf, u"学习曲线", X_final, y_final)

png

(0.80656968448540245, 0.018258876711338634)


在实际数据上看,我们得到的learning curve没有理论推导的那么光滑哈,但是可以大致看出来,训练集和交叉验证集上的得分曲线走势还是符合预期的。

目前的曲线看来,我们的model并不处于overfitting的状态(overfitting的表现一般是训练集上得分高,而交叉验证集上要低很多,中间的gap比较大)。因此我们可以再做些feature engineering的工作,添加一些新产出的特征或者组合特征到模型中。

接下来,我们就该看看如何优化baseline系统了

我们还有些特征可以再挖掘挖掘

  1. 比如说Name和Ticket两个属性被我们完整舍弃了(好吧,其实是一开始我们对于这种,每一条记录都是一个完全不同的值的属性,并没有很直接的处理方式)
  2. 比如说,我们想想,年龄的拟合本身也未必是一件非常靠谱的事情
  3. 另外,以我们的日常经验,小盆友和老人可能得到的照顾会多一些,这样看的话,年龄作为一个连续值,给一个固定的系数,似乎体现不出两头受照顾的实际情况,所以,说不定我们把年龄离散化,按区段分作类别属性会更合适一些

那怎么样才知道,哪些地方可以优化,哪些优化的方法是promising的呢?

是的

要做交叉验证(cross validation)!

要做交叉验证(cross validation)!

要做交叉验证(cross validation)!

重要的事情说3编!!!

因为test.csv里面并没有Survived这个字段(好吧,这是废话,这明明就是我们要预测的结果),我们无法在这份数据上评定我们算法在该场景下的效果。。。

我们通常情况下,这么做cross validation:把train.csv分成两部分,一部分用于训练我们需要的模型,另外一部分数据上看我们预测算法的效果。

我们可以用scikit-learn的cross_validation来完成这个工作

在此之前,咱们可以看看现在得到的模型的系数,因为系数和它们最终的判定能力强弱是正相关的

pd.DataFrame({"columns":list(train_df.columns)[1:], "coef":list(clf.coef_.T)})  #根据LR模型的参数,来选择特征的重要性
coefcolumns
0[-0.34423548326]SibSp
1[-0.104915808836]Parch
2[0.0]Cabin_No
3[0.902107533438]Cabin_Yes
4[0.0]Embarked_C
5[0.0]Embarked_Q
6[-0.417263127613]Embarked_S
7[1.95657020854]Sex_female
8[-0.677421170681]Sex_male
9[0.341159711576]Pclass_1
10[0.0]Pclass_2
11[-1.1941300472]Pclass_3
12[-0.523766573778]Age_scaled
13[0.0844349202536]Fare_scaled

上面的系数和最后的结果是一个正相关的关系

我们先看看那些权重绝对值非常大的feature,在我们的模型上:

  • Sex属性,如果是female会极大提高最后获救的概率,而male会很大程度拉低这个概率。
  • Pclass属性,1等舱乘客最后获救的概率会上升,而乘客等级为3会极大地拉低这个概率。
  • 有Cabin值会很大程度拉升最后获救概率(这里似乎能看到了一点端倪,事实上从最上面的有无Cabin记录的Survived分布图上看出,即使有Cabin记录的乘客也有一部分遇难了,估计这个属性上我们挖掘还不够)
  • Age是一个负相关,意味着在我们的模型里,年龄越小,越有获救的优先权(还得回原数据看看这个是否合理)
  • 有一个登船港口S会很大程度拉低获救的概率,另外俩港口压根就没啥作用(这个实际上非常奇怪,因为我们从之前的统计图上并没有看到S港口的获救率非常低,所以也许可以考虑把登船港口这个feature去掉试试)。
  • 船票Fare有小幅度的正相关(并不意味着这个feature作用不大,有可能是我们细化的程度还不够,举个例子,说不定我们得对它离散化,再分至各个乘客等级上?)

噢啦,观察完了,我们现在有一些想法了,但是怎么样才知道,哪些优化的方法是promising的呢?

恩,要靠交叉验证

from sklearn import cross_validation

# 简单看看打分情况
clf = linear_model.LogisticRegression(C=1.0, penalty='l1', tol=1e-6)
all_data = df.filter(regex='Survived|Age_.*|SibSp|Parch|Fare_.*|Cabin_.*|Embarked_.*|Sex_.*|Pclass_.*')
X = all_data.as_matrix()[:,1:]
y = all_data.as_matrix()[:,0]
print cross_validation.cross_val_score(clf, X, y, cv=5)


# 分割数据
split_train, split_cv = cross_validation.train_test_split(df, test_size=0.3, random_state=0)  #划分训练集和评估集
train_df = split_train.filter(regex='Survived|Age_.*|SibSp|Parch|Fare_.*|Cabin_.*|Embarked_.*|Sex_.*|Pclass_.*')
# 生成模型
clf = linear_model.LogisticRegression(C=1.0, penalty='l1', tol=1e-6)
clf.fit(train_df.as_matrix()[:,1:], train_df.as_matrix()[:,0])   #训练模型 

# # 对cross validation数据进行预测
cv_df = split_cv.filter(regex='Survived|Age_.*|SibSp|Parch|Fare_.*|Cabin_.*|Embarked_.*|Sex_.*|Pclass_.*')
predictions = clf.predict(cv_df.as_matrix()[:,1:])  #得到预测值
# split_cv[predictions != cv_df.as_matrix()[:,0]].drop(axis = 0)     #去除预测错误的样本
[ 0.81564246  0.81564246  0.78651685  0.78651685  0.81355932]


F:\ancoda\soft\envs\py27\lib\site-packages\ipykernel_launcher.py:6: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  
F:\ancoda\soft\envs\py27\lib\site-packages\ipykernel_launcher.py:7: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  import sys
F:\ancoda\soft\envs\py27\lib\site-packages\ipykernel_launcher.py:16: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  app.launch_new_instance()
F:\ancoda\soft\envs\py27\lib\site-packages\ipykernel_launcher.py:20: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
# 去除预测错误的case看原始dataframe数据
#split_cv['PredictResult'] = predictions
origin_data_train = pd.read_csv("Train.csv")
bad_cases = origin_data_train.loc[origin_data_train['PassengerId'].isin(split_cv[predictions != cv_df.as_matrix()[:,0]]['PassengerId'].values)]
bad_cases
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
141503Vestrom, Miss. Hulda Amanda Adolfinafemale14.00003504067.8542NaNS
495003Arnold-Franchi, Mrs. Josef (Josefine Franchi)female18.001034923717.8000NaNS
555611Woolner, Mr. HughmaleNaN001994735.5000C52S
656613Moubarek, Master. GeriosmaleNaN11266115.2458NaNC
686913Andersson, Miss. Erna Alexandrafemale17.004231012817.9250NaNS
858613Backstrom, Mrs. Karl Alfred (Maria Mathilda Gu...female33.0030310127815.8500NaNS
11311403Jussila, Miss. Katriinafemale20.001041369.8250NaNS
14014103Boulos, Mrs. Joseph (Sultana)femaleNaN02267815.2458NaNC
20420513Cohen, Mr. Gurshon "Gus"male18.0000A/5 35408.0500NaNS
24024103Zabour, Miss. ThaminefemaleNaN10266514.4542NaNC
25125203Strom, Mrs. Wilhelm (Elna Matilda Persson)female29.001134705410.4625G6S
26126213Asplund, Master. Edvin Rojj Felixmale3.004234707731.3875NaNS
26426503Henry, Miss. DeliafemaleNaN003826497.7500NaNQ
26726813Persson, Mr. Ernst Ulrikmale25.00103470837.7750NaNS
27127213Tornquist, Mr. William Henrymale25.0000LINE0.0000NaNS
27928013Abbott, Mrs. Stanton (Rosa Hunt)female35.0011C.A. 267320.2500NaNS
28328413Dorking, Mr. Edward Arthurmale19.0000A/5. 104828.0500NaNS
29329403Haas, Miss. Aloisiafemale24.00003492368.8500NaNS
29829911Saalfeld, Mr. AdolphemaleNaN001998830.5000C106S
30130213McCoy, Mr. BernardmaleNaN2036722623.2500NaNQ
31231302Lahtinen, Mrs. William (Anna Sylfven)female26.001125065126.0000NaNS
33833913Dahl, Mr. Karl Edwartmale45.000075988.0500NaNS
36236303Barbara, Mrs. (Catherine David)female45.0001269114.4542NaNC
39039111Carter, Mr. William Ernestmale36.0012113760120.0000B96 B98S
40240303Jussila, Miss. Mari Ainafemale21.001041379.8250NaNS
44744811Seward, Mr. Frederic Kimbermale34.000011379426.5500NaNS
47447503Strandberg, Miss. Ida Sofiafemale22.000075539.8375NaNS
48348413Turkula, Mrs. (Hedwig)female63.000041349.5875NaNS
48949013Coutts, Master. Eden Leslie "Neville"male9.0011C.A. 3767115.9000NaNS
50150203Canavan, Miss. Maryfemale21.00003648467.7500NaNQ
50350403Laitinen, Miss. Kristina Sofiafemale37.000041359.5875NaNS
50550601Penasco y Castellana, Mr. Victor de Satodemale18.0010PC 17758108.9000C65C
56456503Meanwell, Miss. (Marion Ogden)femaleNaN00SOTON/O.Q. 3920878.0500NaNS
56756803Palsson, Mrs. Nils (Alma Cornelia Berglund)female29.000434990921.0750NaNS
57057112Harris, Mr. Georgemale62.0000S.W./PP 75210.5000NaNS
58758811Frolicher-Stehli, Mr. Maxmillianmale60.00111356779.2000B41C
64264303Skoog, Miss. Margit Elizabethfemale2.003234708827.9000NaNS
64364413Foo, Mr. ChoongmaleNaN00160156.4958NaNS
64764811Simonius-Blumer, Col. Oberst Alfonsmale56.00001321335.5000A26C
65465503Hegarty, Miss. Hanora "Nora"female18.00003652266.7500NaNQ
68068103Peters, Miss. KatiefemaleNaN003309358.1375NaNQ
71271311Taylor, Mr. Elmer Zebleymale48.00101999652.0000C126S
74074111Hawksford, Mr. Walter JamesmaleNaN001698830.0000D45S
76276313Barah, Mr. Hanna Assimale20.000026637.2292NaNC
78878913Dean, Master. Bertram Veremale1.0012C.A. 231520.5750NaNS
80380413Thomas, Master. Assad Alexandermale0.420126258.5167NaNC
83883913Chip, Mr. Changmale32.0000160156.4958NaNS
83984011Marechal, Mr. PierremaleNaN001177429.7000C47C
85285303Boulos, Miss. Nourelainfemale9.0011267815.2458NaNC
88288303Dahlberg, Miss. Gerda Ulrikafemale22.0000755210.5167NaNS

对比bad case,我们仔细看看我们预测错的样本,到底是哪些特征有问题,咱们处理得还不够细?

我们随便列一些可能可以做的优化操作:

  • Age属性不使用现在的拟合方式,而是根据名称中的『Mr』『Mrs』『Miss』等的平均值进行填充。
  • Age不做成一个连续值属性,而是使用一个步长进行离散化,变成离散的类目feature。
  • Cabin再细化一些,对于有记录的Cabin属性,我们将其分为前面的字母部分(我猜是位置和船层之类的信息) 和 后面的数字部分(应该是房间号,有意思的事情是,如果你仔细看看原始数据,你会发现,这个值大的情况下,似乎获救的可能性高一些)。
  • Pclass和Sex俩太重要了,我们试着用它们去组出一个组合属性来试试,这也是另外一种程度的细化。
  • 单加一个Child字段,Age<=12的,设为1,其余为0(你去看看数据,确实小盆友优先程度很高啊)
  • 如果名字里面有『Mrs』,而Parch>1的,我们猜测她可能是一个母亲,应该获救的概率也会提高,因此可以多加一个Mother字段,此种情况下设为1,其余情况下设为0
  • 登船港口可以考虑先去掉试试(Q和C本来就没权重,S有点诡异)
  • 把堂兄弟/兄妹 和 Parch 还有自己 个数加在一起组一个Family_size字段(考虑到大家族可能对最后的结果有影响)
  • Name是一个我们一直没有触碰的属性,我们可以做一些简单的处理,比如说男性中带某些字眼的(‘Capt’, ‘Don’, ‘Major’, ‘Sir’)可以统一到一个Title,女性也一样。

大家接着往下挖掘,可能还可以想到更多可以细挖的部分。我这里先列这些了,然后我们可以使用手头上的”train_df”和”cv_df”开始试验这些feature engineering的tricks是否有效了。

data_train[data_train['Name'].str.contains("Major")]
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
44945011Peuchen, Major. Arthur Godfreymale520011378630.50YesS
53653701Butt, Major. Archibald Willinghammale450011305026.55YesS
data_train = pd.read_csv("Train.csv")
data_train['Sex_Pclass'] = data_train.Sex + "_" + data_train.Pclass.map(str)

from sklearn.ensemble import RandomForestRegressor
 
### 使用 RandomForestClassifier 填补缺失的年龄属性
def set_missing_ages(df):
    
    # 把已有的数值型特征取出来丢进Random Forest Regressor中
    age_df = df[['Age','Fare', 'Parch', 'SibSp', 'Pclass']]

    # 乘客分成已知年龄和未知年龄两部分
    known_age = age_df[age_df.Age.notnull()].as_matrix()
    unknown_age = age_df[age_df.Age.isnull()].as_matrix()

    # y即目标年龄
    y = known_age[:, 0]

    # X即特征属性值
    X = known_age[:, 1:]

    # fit到RandomForestRegressor之中
    rfr = RandomForestRegressor(random_state=0, n_estimators=2000, n_jobs=-1)
    rfr.fit(X, y)
    
    # 用得到的模型进行未知年龄结果预测
    predictedAges = rfr.predict(unknown_age[:, 1::])
    
    # 用得到的预测结果填补原缺失数据
    df.loc[ (df.Age.isnull()), 'Age' ] = predictedAges 
    
    return df, rfr

def set_Cabin_type(df):
    df.loc[ (df.Cabin.notnull()), 'Cabin' ] = "Yes"
    df.loc[ (df.Cabin.isnull()), 'Cabin' ] = "No"
    return df

data_train, rfr = set_missing_ages(data_train)
data_train = set_Cabin_type(data_train)

dummies_Cabin = pd.get_dummies(data_train['Cabin'], prefix= 'Cabin')
dummies_Embarked = pd.get_dummies(data_train['Embarked'], prefix= 'Embarked')
dummies_Sex = pd.get_dummies(data_train['Sex'], prefix= 'Sex')
dummies_Pclass = pd.get_dummies(data_train['Pclass'], prefix= 'Pclass')
dummies_Sex_Pclass = pd.get_dummies(data_train['Sex_Pclass'], prefix= 'Sex_Pclass')


df = pd.concat([data_train, dummies_Cabin, dummies_Embarked, dummies_Sex, dummies_Pclass, dummies_Sex_Pclass], axis=1)
df.drop(['Pclass', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked', 'Sex_Pclass'], axis=1, inplace=True)
import sklearn.preprocessing as preprocessing
scaler = preprocessing.StandardScaler()
age_scale_param = scaler.fit(np.array(df['Age']).reshape(-1,1))
df['Age_scaled'] = scaler.fit_transform(np.array(df['Age']).reshape(-1,1), age_scale_param)
fare_scale_param = scaler.fit(np.array(df['Fare']).reshape(-1,1))
df['Fare_scaled'] = scaler.fit_transform(np.array(df['Fare']).reshape(-1,1), fare_scale_param)

from sklearn import linear_model

train_df = df.filter(regex='Survived|Age_.*|SibSp|Parch|Fare_.*|Cabin_.*|Embarked_.*|Sex_.*|Pclass.*')
train_np = train_df.as_matrix()

# y即Survival结果
y = train_np[:, 0]

# X即特征属性值
X = train_np[:, 1:]

# fit到RandomForestRegressor之中
clf = linear_model.LogisticRegression(C=1.0, penalty='l1', tol=1e-6)
clf.fit(X, y)
clf
F:\ancoda\soft\envs\py27\lib\site-packages\ipykernel_launcher.py:13: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  del sys.path[0]
F:\ancoda\soft\envs\py27\lib\site-packages\ipykernel_launcher.py:14: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  
F:\ancoda\soft\envs\py27\lib\site-packages\ipykernel_launcher.py:61: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=1e-06,
          verbose=0, warm_start=False)
data_test = pd.read_csv("test.csv")
data_test.loc[ (data_test.Fare.isnull()), 'Fare' ] = 0
data_test['Sex_Pclass'] = data_test.Sex + "_" + data_test.Pclass.map(str)
# 接着我们对test_data做和train_data中一致的特征变换
# 首先用同样的RandomForestRegressor模型填上丢失的年龄
tmp_df = data_test[['Age','Fare', 'Parch', 'SibSp', 'Pclass']]
null_age = tmp_df[data_test.Age.isnull()].as_matrix()
# 根据特征属性X预测年龄并补上
X = null_age[:, 1:]
predictedAges = rfr.predict(X)
data_test.loc[ (data_test.Age.isnull()), 'Age' ] = predictedAges

data_test = set_Cabin_type(data_test)
dummies_Cabin = pd.get_dummies(data_test['Cabin'], prefix= 'Cabin')
dummies_Embarked = pd.get_dummies(data_test['Embarked'], prefix= 'Embarked')
dummies_Sex = pd.get_dummies(data_test['Sex'], prefix= 'Sex')
dummies_Pclass = pd.get_dummies(data_test['Pclass'], prefix= 'Pclass')
dummies_Sex_Pclass = pd.get_dummies(data_test['Sex_Pclass'], prefix= 'Sex_Pclass')


df_test = pd.concat([data_test, dummies_Cabin, dummies_Embarked, dummies_Sex, dummies_Pclass, dummies_Sex_Pclass], axis=1)
df_test.drop(['Pclass', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked', 'Sex_Pclass'], axis=1, inplace=True)
df_test['Age_scaled'] = scaler.fit_transform(np.array(df_test['Age']).reshape(-1,1), age_scale_param)
df_test['Fare_scaled'] = scaler.fit_transform(np.array(df_test['Fare']).reshape(-1,1), fare_scale_param)
df_test
F:\ancoda\soft\envs\py27\lib\site-packages\ipykernel_launcher.py:7: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  import sys
PassengerIdAgeSibSpParchFareCabin_NoCabin_YesEmbarked_CEmbarked_QEmbarked_S...Pclass_2Pclass_3Sex_Pclass_female_1Sex_Pclass_female_2Sex_Pclass_female_3Sex_Pclass_male_1Sex_Pclass_male_2Sex_Pclass_male_3Age_scaledFare_scaled
089234.500000007.829210010...010000010.307521-0.496637
189347.000000107.000010001...010010001.256241-0.511497
289462.000000009.687510010...100000102.394706-0.463335
389527.000000008.662510001...01000001-0.261711-0.481704
489622.0000001112.287510001...01001000-0.641199-0.416740
589714.000000009.225010001...01000001-1.248380-0.471623
689830.000000007.629210010...01001000-0.034018-0.500221
789926.0000001129.000010001...10000010-0.337609-0.117238
890018.000000007.229210100...01001000-0.944790-0.507390
990121.0000002024.150010001...01000001-0.717097-0.204154
1090227.947206007.895810001...01000001-0.189820-0.495444
1190346.0000000026.000010001...000001001.180344-0.171000
1290423.0000001082.266701001...00100000-0.5653010.837349
1390563.0000001026.000010001...100000102.470603-0.171000
1490647.0000001061.175001001...001000001.2562410.459367
1590724.0000001027.720810100...10010000-0.489404-0.140162
1690835.0000000012.350010010...100000100.345470-0.415620
1790921.000000007.225010100...01000001-0.717097-0.507465
1891027.000000107.925010001...01001000-0.261711-0.494920
1991145.000000007.225010100...010010001.104446-0.507465
2091255.0000001059.400010100...000001001.8634220.427557
219139.000000013.170810001...01000001-1.627868-0.580120
2291452.3143110031.683310001...001000001.659585-0.069151
2391521.0000000161.379210100...00000100-0.7170970.463026
2491648.00000013262.375001100...001000001.3321394.065049
2591750.0000001014.500010001...010000011.483934-0.377090
2691822.0000000161.979201100...00100000-0.6411990.473779
2791922.500000007.225010100...01000001-0.603250-0.507465
2892041.0000000030.500001001...000001000.800856-0.090356
2992123.4596832021.679210100...01000001-0.530413-0.248433
..................................................................
388128021.000000007.750010010...01000001-0.717097-0.498056
38912816.0000003121.075010001...01000001-1.855561-0.259261
390128223.0000000093.500001001...00000100-0.5653011.038659
391128351.0000000139.400001001...001000001.5598320.069140
392128413.0000000220.250010001...01000001-1.324278-0.274045
393128547.0000000010.500010001...100000101.256241-0.448774
394128629.0000003122.025010001...01000001-0.109916-0.242236
395128718.0000001060.000001001...00100000-0.9447900.438310
396128824.000000007.250010010...01000001-0.489404-0.507017
397128948.0000001179.200001100...001000001.3321390.782391
398129022.000000007.775010001...01000001-0.641199-0.497608
399129131.000000007.733310010...010000010.041880-0.498356
400129230.00000000164.866701001...00100000-0.0340182.317614
401129338.0000001021.000010001...100000100.573163-0.260605
402129422.0000000159.400010100...00100000-0.6411990.427557
403129517.0000000047.100010001...00000100-1.0206870.207130
404129643.0000001027.720801100...000001000.952651-0.140162
405129720.0000000013.862501100...10000010-0.792994-0.388515
406129823.0000001010.500010001...10000010-0.565301-0.448774
407129950.00000011211.500001100...000001001.4839343.153324
408130019.895581007.720810010...01001000-0.800919-0.498580
40913013.0000001113.775010001...01001000-2.083254-0.390083
410130235.295824007.750010010...010010000.367922-0.498056
411130337.0000001090.000001010...001000000.4972650.975936
412130428.000000007.775010001...01001000-0.185813-0.497608
413130530.705727008.050010001...010000010.019545-0.492680
414130639.00000000108.900001100...001000000.6490611.314641
415130738.500000007.250010001...010000010.611112-0.507017
416130830.705727008.050010001...010000010.019545-0.492680
417130925.7833771122.358310100...01000001-0.354050-0.236263

418 rows × 23 columns

test = df_test.filter(regex='Age_.*|SibSp|Parch|Fare_.*|Cabin_.*|Embarked_.*|Sex_.*|Pclass.*')
predictions = clf.predict(test)
result = pd.DataFrame({'PassengerId':data_test['PassengerId'].as_matrix(), 'Survived':predictions.astype(np.int32)})
result.to_csv("logistic_regression_predictions2.csv", index=False)
F:\ancoda\soft\envs\py27\lib\site-packages\ipykernel_launcher.py:3: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  This is separate from the ipykernel package so we can avoid doing imports until

一般做到后期,咱们要进行模型优化的方法就是模型融合啦

先解释解释啥叫模型融合哈,我们还是举几个例子直观理解一下好了。

大家都看过知识问答的综艺节目中,求助现场观众时候,让观众投票,最高的答案作为自己的答案的形式吧,每个人都有一个判定结果,最后我们相信答案在大多数人手里。

再通俗一点举个例子。你和你班某数学大神关系好,每次作业都『模仿』他的,于是绝大多数情况下,他做对了,你也对了。突然某一天大神脑子犯糊涂,手一抖,写错了一个数,于是…恩,你也只能跟着错了。

我们再来看看另外一个场景,你和你班5个数学大神关系都很好,每次都把他们作业拿过来,对比一下,再『自己做』,那你想想,如果哪天某大神犯糊涂了,写错了,but另外四个写对了啊,那你肯定相信另外4人的是正确答案吧?

最简单的模型融合大概就是这么个意思,比如分类问题,当我们手头上有一堆在同一份数据集上训练得到的分类器(比如logistic regression,SVM,KNN,random forest,神经网络),那我们让他们都分别去做判定,然后对结果做投票统计,取票数最多的结果为最后结果。

bingo,问题就这么完美的解决了。

模型融合可以比较好地缓解,训练过程中产生的过拟合问题,从而对于结果的准确度提升有一定的帮助。

话说回来,回到我们现在的问题。你看,我们现在只讲了logistic regression,如果我们还想用这个融合思想去提高我们的结果,我们该怎么做呢?

既然这个时候模型没得选,那咱们就在数据上动动手脚咯。大家想想,如果模型出现过拟合现在,一定是在我们的训练上出现拟合过度造成的对吧。

那我们干脆就不要用全部的训练集,每次取训练集的一个subset,做训练,这样,我们虽然用的是同一个机器学习算法,但是得到的模型却是不一样的;同时,因为我们没有任何一份子数据集是全的,因此即使出现过拟合,也是在子训练集上出现过拟合,而不是全体数据上,这样做一个融合,可能对最后的结果有一定的帮助。对,这就是常用的Bagging。

我们用scikit-learn里面的Bagging来完成上面的思路,过程非常简单。代码如下:

第六步:模型融合

from sklearn.ensemble import BaggingRegressor

train_df = df.filter(regex='Survived|Age_.*|SibSp|Parch|Fare_.*|Cabin_.*|Embarked_.*|Sex_.*|Pclass.*|Mother|Child|Family|Title')
train_np = train_df.as_matrix()

# y即Survival结果
y = train_np[:, 0]

# X即特征属性值
X = train_np[:, 1:]

# fit到BaggingRegressor之中
clf = linear_model.LogisticRegression(C=1.0, penalty='l1', tol=1e-6)
bagging_clf = BaggingRegressor(clf, n_estimators=10, max_samples=0.8, max_features=1.0, bootstrap=True, bootstrap_features=False, n_jobs=-1)
bagging_clf.fit(X, y)

test = df_test.filter(regex='Age_.*|SibSp|Parch|Fare_.*|Cabin_.*|Embarked_.*|Sex_.*|Pclass.*|Mother|Child|Family|Title')
predictions = bagging_clf.predict(test)
result = pd.DataFrame({'PassengerId':data_test['PassengerId'].as_matrix(), 'Survived':predictions.astype(np.int32)})
result.to_csv("logistic_regression_predictions2.csv", index=False)
F:\ancoda\soft\envs\py27\lib\site-packages\ipykernel_launcher.py:4: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  after removing the cwd from sys.path.
F:\ancoda\soft\envs\py27\lib\site-packages\ipykernel_launcher.py:19: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
result
PassengerIdSurvived
08920
18930
28940
38950
48960
58970
68981
78990
89001
99010
109020
119030
129041
139050
149061
159071
169080
179090
189100
199110
209120
219130
229141
239150
249161
259170
269181
279190
289200
299210
.........
38812800
38912810
39012820
39112831
39212840
39312850
39412860
39512871
39612880
39712891
39812900
39912910
40012921
40112930
40212941
40312950
40412960
40512970
40612980
40712990
40813001
40913011
41013020
41113031
41213040
41313050
41413061
41513070
41613080
41713090

418 rows × 2 columns

下面是咱们用别的分类器解决这个问题的代码:

import numpy as np
import pandas as pd
from pandas import  DataFrame
from patsy import dmatrices
import string
from operator import itemgetter
import json
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import train_test_split,StratifiedShuffleSplit,StratifiedKFold
from sklearn import preprocessing
from sklearn.metrics import classification_report
from sklearn.externals import joblib

##Read configuration parameters

train_file="train.csv"
MODEL_PATH="./"
test_file="test.csv"
SUBMISSION_PATH="./"
seed= 0

print train_file,seed

# 输出得分
def report(grid_scores, n_top=3):
    top_scores = sorted(grid_scores, key=itemgetter(1), reverse=True)[:n_top]
    for i, score in enumerate(top_scores):
        print("Model with rank: {0}".format(i + 1))
        print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
              score.mean_validation_score,
              np.std(score.cv_validation_scores)))
        print("Parameters: {0}".format(score.parameters))
        print("")

#清理和处理数据
def substrings_in_string(big_string, substrings):
    for substring in substrings:
        if string.find(big_string, substring) != -1:
            return substring
    print big_string
    return np.nan

le = preprocessing.LabelEncoder()   
enc=preprocessing.OneHotEncoder()
# 
def clean_and_munge_data(df):
    #处理缺省值
    df.Fare = df.Fare.map(lambda x: np.nan if x==0 else x)  #使用0填充缺失值
    #处理一下名字,生成Title字段
    title_list=['Mrs', 'Mr', 'Master', 'Miss', 'Major', 'Rev',
                'Dr', 'Ms', 'Mlle','Col', 'Capt', 'Mme', 'Countess',
                'Don', 'Jonkheer']
    df['Title']=df['Name'].map(lambda x: substrings_in_string(x, title_list))

    #处理特殊的称呼,全处理成mr, mrs, miss, master
    def replace_titles(x):
        title=x['Title']
        if title in ['Mr','Don', 'Major', 'Capt', 'Jonkheer', 'Rev', 'Col']:
            return 'Mr'
        elif title in ['Master']:
            return 'Master'
        elif title in ['Countess', 'Mme','Mrs']:
            return 'Mrs'
        elif title in ['Mlle', 'Ms','Miss']:
            return 'Miss'
        elif title =='Dr':
            if x['Sex']=='Male':
                return 'Mr'
            else:
                return 'Mrs'
        elif title =='':
            if x['Sex']=='Male':
                return 'Master'
            else:
                return 'Miss'
        else:
            return title

    df['Title']=df.apply(replace_titles, axis=1)

    #看看家族是否够大,咳咳
    df['Family_Size']=df['SibSp']+df['Parch']
    df['Family']=df['SibSp']*df['Parch']


    df.loc[ (df.Fare.isnull())&(df.Pclass==1),'Fare'] =np.median(df[df['Pclass'] == 1]['Fare'].dropna())
    df.loc[ (df.Fare.isnull())&(df.Pclass==2),'Fare'] =np.median( df[df['Pclass'] == 2]['Fare'].dropna())
    df.loc[ (df.Fare.isnull())&(df.Pclass==3),'Fare'] = np.median(df[df['Pclass'] == 3]['Fare'].dropna())

    df['Gender'] = df['Sex'].map( {'female': 0, 'male': 1} ).astype(int)

    df['AgeFill']=df['Age']
    mean_ages = np.zeros(4)
    mean_ages[0]=np.average(df[df['Title'] == 'Miss']['Age'].dropna())
    mean_ages[1]=np.average(df[df['Title'] == 'Mrs']['Age'].dropna())
    mean_ages[2]=np.average(df[df['Title'] == 'Mr']['Age'].dropna())
    mean_ages[3]=np.average(df[df['Title'] == 'Master']['Age'].dropna())
    df.loc[ (df.Age.isnull()) & (df.Title == 'Miss') ,'AgeFill'] = mean_ages[0]
    df.loc[ (df.Age.isnull()) & (df.Title == 'Mrs') ,'AgeFill'] = mean_ages[1]
    df.loc[ (df.Age.isnull()) & (df.Title == 'Mr') ,'AgeFill'] = mean_ages[2]
    df.loc[ (df.Age.isnull()) & (df.Title == 'Master') ,'AgeFill'] = mean_ages[3]

    df['AgeCat']=df['AgeFill']
    df.loc[ (df.AgeFill<=10) ,'AgeCat'] = 'child'
    df.loc[ (df.AgeFill>60),'AgeCat'] = 'aged'
    df.loc[ (df.AgeFill>10) & (df.AgeFill <=30) ,'AgeCat'] = 'adult'
    df.loc[ (df.AgeFill>30) & (df.AgeFill <=60) ,'AgeCat'] = 'senior'

    df.Embarked = df.Embarked.fillna('S')


    df.loc[ df.Cabin.isnull()==True,'Cabin'] = 0.5
    df.loc[ df.Cabin.isnull()==False,'Cabin'] = 1.5

    df['Fare_Per_Person']=df['Fare']/(df['Family_Size']+1)

    #Age times class

    df['AgeClass']=df['AgeFill']*df['Pclass']
    df['ClassFare']=df['Pclass']*df['Fare_Per_Person']


    df['HighLow']=df['Pclass']
    df.loc[ (df.Fare_Per_Person<8) ,'HighLow'] = 'Low'
    df.loc[ (df.Fare_Per_Person>=8) ,'HighLow'] = 'High'



    le.fit(df['Sex'] )
    x_sex=le.transform(df['Sex'])
    df['Sex']=x_sex.astype(np.float)

    le.fit( df['Ticket'])
    x_Ticket=le.transform( df['Ticket'])
    df['Ticket']=x_Ticket.astype(np.float)

    le.fit(df['Title'])
    x_title=le.transform(df['Title'])
    df['Title'] =x_title.astype(np.float)

    le.fit(df['HighLow'])
    x_hl=le.transform(df['HighLow'])
    df['HighLow']=x_hl.astype(np.float)


    le.fit(df['AgeCat'])
    x_age=le.transform(df['AgeCat'])
    df['AgeCat'] =x_age.astype(np.float)

    le.fit(df['Embarked'])
    x_emb=le.transform(df['Embarked'])
    df['Embarked']=x_emb.astype(np.float)

    df = df.drop(['PassengerId','Name','Age','Cabin'], axis=1) #remove Name,Age and PassengerId


    return df

#读取数据
traindf=pd.read_csv(train_file)
##清洗数据
df=clean_and_munge_data(traindf)
########################################formula################################
 
formula_ml='Survived~Pclass+C(Title)+Sex+C(AgeCat)+Fare_Per_Person+Fare+Family_Size' 

y_train, x_train = dmatrices(formula_ml, data=df, return_type='dataframe')
y_train = np.asarray(y_train).ravel()
print y_train.shape,x_train.shape

##选择训练和测试集
X_train, X_test, Y_train, Y_test = train_test_split(x_train, y_train, test_size=0.2,random_state=seed)
#初始化分类器
clf=RandomForestClassifier(n_estimators=500, criterion='entropy', max_depth=5, min_samples_split=1,
  min_samples_leaf=1, max_features='auto',    bootstrap=False, oob_score=False, n_jobs=1, random_state=seed,
  verbose=0)

###grid search找到最好的参数
param_grid = dict( )
##创建分类pipeline
pipeline=Pipeline([ ('clf',clf) ])
grid_search = GridSearchCV(pipeline, param_grid=param_grid, verbose=3,scoring='accuracy',\
cv=StratifiedShuffleSplit(Y_train, n_iter=10, test_size=0.2, train_size=None, indices=None, \
random_state=seed, n_iterations=None)).fit(X_train, Y_train)
# 对结果打分
print("Best score: %0.3f" % grid_search.best_score_)
print(grid_search.best_estimator_)
report(grid_search.grid_scores_)
 
print('-----grid search end------------')
print ('on all train set')
scores = cross_val_score(grid_search.best_estimator_, x_train, y_train,cv=3,scoring='accuracy')
print scores.mean(),scores
print ('on test set')
scores = cross_val_score(grid_search.best_estimator_, X_test, Y_test,cv=3,scoring='accuracy')
print scores.mean(),scores

# 对结果打分

print(classification_report(Y_train, grid_search.best_estimator_.predict(X_train) ))
print('test data')
print(classification_report(Y_test, grid_search.best_estimator_.predict(X_test) ))

model_file=MODEL_PATH+'model-rf.pkl'
joblib.dump(grid_search.best_estimator_, model_file)
/Users/MLS/Downloads/train.csv 0
(891,) (891, 12)
Fitting 10 folds for each of 1 candidates, totalling 10 fits
[CV]  ................................................................
[CV] ....................................... , score=0.860140 -   0.4s
[CV]  ................................................................
[CV] ....................................... , score=0.832168 -   0.4s
[CV]  ................................................................
[CV] ....................................... , score=0.818182 -   0.4s
[CV]  ................................................................
[CV] ....................................... , score=0.839161 -   0.4s
[CV]  ................................................................
[CV] ....................................... , score=0.811189 -   0.5s
[CV]  ................................................................
[CV] ....................................... , score=0.874126 -   0.4s
[CV]  ................................................................
[CV] ....................................... , score=0.811189 -   0.4s
[CV]  ................................................................
[CV] ....................................... , score=0.783217 -   0.4s
[CV]  ................................................................
[CV] ....................................... , score=0.825175 -   0.4s
[CV]  ................................................................
[CV] ....................................... , score=0.839161 -   0.4s

[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:    0.4s
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    4.1s finished



Best score: 0.829
Pipeline(steps=[('clf', RandomForestClassifier(bootstrap=False, class_weight=None,
            criterion='entropy', max_depth=5, max_features='auto',
            max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=1,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=1,
            oob_score=False, random_state=0, verbose=0, warm_start=False))])
Model with rank: 1
Mean validation score: 0.829 (std: 0.025)
Parameters: {}

-----grid search end------------
on all train set
0.826038159371 [ 0.81144781  0.83501684  0.83164983]
on test set
0.782203389831 [ 0.76666667  0.78333333  0.79661017]
             precision    recall  f1-score   support

        0.0       0.86      0.90      0.88       439
        1.0       0.83      0.75      0.79       273

avg / total       0.85      0.85      0.85       712

test data
             precision    recall  f1-score   support

        0.0       0.86      0.87      0.86       110
        1.0       0.79      0.77      0.78        69

avg / total       0.83      0.83      0.83       179






['/Users/MLS/Downloads/model-rf.pkl',
 '/Users/MLS/Downloads/model-rf.pkl_01.npy',
 '/Users/MLS/Downloads/model-rf.pkl_02.npy',
 '/Users/MLS/Downloads/model-rf.pkl_03.npy',
 '/Users/MLS/Downloads/model-rf.pkl_04.npy',
 '/Users/MLS/Downloads/model-rf.pkl_05.npy',
 '/Users/MLS/Downloads/model-rf.pkl_06.npy',
 '/Users/MLS/Downloads/model-rf.pkl_07.npy',
 '/Users/MLS/Downloads/model-rf.pkl_08.npy',
 '/Users/MLS/Downloads/model-rf.pkl_09.npy',
 '/Users/MLS/Downloads/model-rf.pkl_10.npy',
 '/Users/MLS/Downloads/model-rf.pkl_11.npy',
 '/Users/MLS/Downloads/model-rf.pkl_12.npy',
 '/Users/MLS/Downloads/model-rf.pkl_13.npy',
 '/Users/MLS/Downloads/model-rf.pkl_14.npy',
 '/Users/MLS/Downloads/model-rf.pkl_15.npy',
 '/Users/MLS/Downloads/model-rf.pkl_16.npy',
 '/Users/MLS/Downloads/model-rf.pkl_17.npy',
 '/Users/MLS/Downloads/model-rf.pkl_18.npy',
 '/Users/MLS/Downloads/model-rf.pkl_19.npy',
 '/Users/MLS/Downloads/model-rf.pkl_20.npy',
 '/Users/MLS/Downloads/model-rf.pkl_21.npy',
 '/Users/MLS/Downloads/model-rf.pkl_22.npy',
 '/Users/MLS/Downloads/model-rf.pkl_23.npy',
 '/Users/MLS/Downloads/model-rf.pkl_24.npy',
 '/Users/MLS/Downloads/model-rf.pkl_25.npy',
 '/Users/MLS/Downloads/model-rf.pkl_26.npy',
 '/Users/MLS/Downloads/model-rf.pkl_27.npy',
 '/Users/MLS/Downloads/model-rf.pkl_28.npy',
 '/Users/MLS/Downloads/model-rf.pkl_29.npy',
 '/Users/MLS/Downloads/model-rf.pkl_30.npy',
 '/Users/MLS/Downloads/model-rf.pkl_31.npy',
 '/Users/MLS/Downloads/model-rf.pkl_32.npy',
 '/Users/MLS/Downloads/model-rf.pkl_33.npy',
 '/Users/MLS/Downloads/model-rf.pkl_34.npy',
 '/Users/MLS/Downloads/model-rf.pkl_35.npy',
 '/Users/MLS/Downloads/model-rf.pkl_36.npy',
 '/Users/MLS/Downloads/model-rf.pkl_37.npy',
 '/Users/MLS/Downloads/model-rf.pkl_38.npy',
 '/Users/MLS/Downloads/model-rf.pkl_39.npy',
 '/Users/MLS/Downloads/model-rf.pkl_40.npy',
 '/Users/MLS/Downloads/model-rf.pkl_41.npy',
 '/Users/MLS/Downloads/model-rf.pkl_42.npy',
 '/Users/MLS/Downloads/model-rf.pkl_43.npy',
 '/Users/MLS/Downloads/model-rf.pkl_44.npy',
 '/Users/MLS/Downloads/model-rf.pkl_45.npy',
 '/Users/MLS/Downloads/model-rf.pkl_46.npy',
 '/Users/MLS/Downloads/model-rf.pkl_47.npy',
 '/Users/MLS/Downloads/model-rf.pkl_48.npy',
 '/Users/MLS/Downloads/model-rf.pkl_49.npy',
 '/Users/MLS/Downloads/model-rf.pkl_50.npy',
 '/Users/MLS/Downloads/model-rf.pkl_51.npy',
 '/Users/MLS/Downloads/model-rf.pkl_52.npy',
 '/Users/MLS/Downloads/model-rf.pkl_53.npy',
 '/Users/MLS/Downloads/model-rf.pkl_54.npy',
 '/Users/MLS/Downloads/model-rf.pkl_55.npy',
 '/Users/MLS/Downloads/model-rf.pkl_56.npy',
 '/Users/MLS/Downloads/model-rf.pkl_57.npy',
 '/Users/MLS/Downloads/model-rf.pkl_58.npy',
 '/Users/MLS/Downloads/model-rf.pkl_59.npy',
 '/Users/MLS/Downloads/model-rf.pkl_60.npy',
 '/Users/MLS/Downloads/model-rf.pkl_61.npy',
 '/Users/MLS/Downloads/model-rf.pkl_62.npy',
 '/Users/MLS/Downloads/model-rf.pkl_63.npy',
 '/Users/MLS/Downloads/model-rf.pkl_64.npy',
 '/Users/MLS/Downloads/model-rf.pkl_65.npy',
 '/Users/MLS/Downloads/model-rf.pkl_66.npy',
 '/Users/MLS/Downloads/model-rf.pkl_67.npy',
 '/Users/MLS/Downloads/model-rf.pkl_68.npy',
 '/Users/MLS/Downloads/model-rf.pkl_69.npy',
 '/Users/MLS/Downloads/model-rf.pkl_70.npy',
 '/Users/MLS/Downloads/model-rf.pkl_71.npy',
 '/Users/MLS/Downloads/model-rf.pkl_72.npy',
 '/Users/MLS/Downloads/model-rf.pkl_73.npy',
 '/Users/MLS/Downloads/model-rf.pkl_74.npy',
 '/Users/MLS/Downloads/model-rf.pkl_75.npy',
 '/Users/MLS/Downloads/model-rf.pkl_76.npy',
 '/Users/MLS/Downloads/model-rf.pkl_77.npy',
 '/Users/MLS/Downloads/model-rf.pkl_78.npy',
 '/Users/MLS/Downloads/model-rf.pkl_79.npy',
 '/Users/MLS/Downloads/model-rf.pkl_80.npy',
 '/Users/MLS/Downloads/model-rf.pkl_81.npy',
 '/Users/MLS/Downloads/model-rf.pkl_82.npy',
 '/Users/MLS/Downloads/model-rf.pkl_83.npy',
 '/Users/MLS/Downloads/model-rf.pkl_84.npy',
 '/Users/MLS/Downloads/model-rf.pkl_85.npy',
 '/Users/MLS/Downloads/model-rf.pkl_86.npy',
 '/Users/MLS/Downloads/model-rf.pkl_87.npy',
 '/Users/MLS/Downloads/model-rf.pkl_88.npy',
 '/Users/MLS/Downloads/model-rf.pkl_89.npy',
 '/Users/MLS/Downloads/model-rf.pkl_90.npy',
 '/Users/MLS/Downloads/model-rf.pkl_91.npy',
 '/Users/MLS/Downloads/model-rf.pkl_92.npy',
 '/Users/MLS/Downloads/model-rf.pkl_93.npy',
 '/Users/MLS/Downloads/model-rf.pkl_94.npy',
 '/Users/MLS/Downloads/model-rf.pkl_95.npy',
 '/Users/MLS/Downloads/model-rf.pkl_96.npy',
 '/Users/MLS/Downloads/model-rf.pkl_97.npy',
 '/Users/MLS/Downloads/model-rf.pkl_98.npy',
 '/Users/MLS/Downloads/model-rf.pkl_99.npy',
 '/Users/MLS/Downloads/model-rf.pkl_100.npy',
 '/Users/MLS/Downloads/model-rf.pkl_101.npy',
 '/Users/MLS/Downloads/model-rf.pkl_102.npy',
 '/Users/MLS/Downloads/model-rf.pkl_103.npy',
 '/Users/MLS/Downloads/model-rf.pkl_104.npy',
 '/Users/MLS/Downloads/model-rf.pkl_105.npy',
 '/Users/MLS/Downloads/model-rf.pkl_106.npy',
 '/Users/MLS/Downloads/model-rf.pkl_107.npy',
 '/Users/MLS/Downloads/model-rf.pkl_108.npy',
 '/Users/MLS/Downloads/model-rf.pkl_109.npy',
 '/Users/MLS/Downloads/model-rf.pkl_110.npy',
 '/Users/MLS/Downloads/model-rf.pkl_111.npy',
 '/Users/MLS/Downloads/model-rf.pkl_112.npy',
 '/Users/MLS/Downloads/model-rf.pkl_113.npy',
 '/Users/MLS/Downloads/model-rf.pkl_114.npy',
 '/Users/MLS/Downloads/model-rf.pkl_115.npy',
 '/Users/MLS/Downloads/model-rf.pkl_116.npy',
 '/Users/MLS/Downloads/model-rf.pkl_117.npy',
 '/Users/MLS/Downloads/model-rf.pkl_118.npy',
 '/Users/MLS/Downloads/model-rf.pkl_119.npy',
 '/Users/MLS/Downloads/model-rf.pkl_120.npy',
 '/Users/MLS/Downloads/model-rf.pkl_121.npy',
 '/Users/MLS/Downloads/model-rf.pkl_122.npy',
 '/Users/MLS/Downloads/model-rf.pkl_123.npy',
 '/Users/MLS/Downloads/model-rf.pkl_124.npy',
 '/Users/MLS/Downloads/model-rf.pkl_125.npy',
 '/Users/MLS/Downloads/model-rf.pkl_126.npy',
 '/Users/MLS/Downloads/model-rf.pkl_127.npy',
 '/Users/MLS/Downloads/model-rf.pkl_128.npy',
 '/Users/MLS/Downloads/model-rf.pkl_129.npy',
 '/Users/MLS/Downloads/model-rf.pkl_130.npy',
 '/Users/MLS/Downloads/model-rf.pkl_131.npy',
 '/Users/MLS/Downloads/model-rf.pkl_132.npy',
 '/Users/MLS/Downloads/model-rf.pkl_133.npy',
 '/Users/MLS/Downloads/model-rf.pkl_134.npy',
 '/Users/MLS/Downloads/model-rf.pkl_135.npy',
 '/Users/MLS/Downloads/model-rf.pkl_136.npy',
 '/Users/MLS/Downloads/model-rf.pkl_137.npy',
 '/Users/MLS/Downloads/model-rf.pkl_138.npy',
 '/Users/MLS/Downloads/model-rf.pkl_139.npy',
 '/Users/MLS/Downloads/model-rf.pkl_140.npy',
 '/Users/MLS/Downloads/model-rf.pkl_141.npy',
 '/Users/MLS/Downloads/model-rf.pkl_142.npy',
 '/Users/MLS/Downloads/model-rf.pkl_143.npy',
 '/Users/MLS/Downloads/model-rf.pkl_144.npy',
 '/Users/MLS/Downloads/model-rf.pkl_145.npy',
 '/Users/MLS/Downloads/model-rf.pkl_146.npy',
 '/Users/MLS/Downloads/model-rf.pkl_147.npy',
 '/Users/MLS/Downloads/model-rf.pkl_148.npy',
 '/Users/MLS/Downloads/model-rf.pkl_149.npy',
 '/Users/MLS/Downloads/model-rf.pkl_150.npy',
 '/Users/MLS/Downloads/model-rf.pkl_151.npy',
 '/Users/MLS/Downloads/model-rf.pkl_152.npy',
 '/Users/MLS/Downloads/model-rf.pkl_153.npy',
 '/Users/MLS/Downloads/model-rf.pkl_154.npy',
 '/Users/MLS/Downloads/model-rf.pkl_155.npy',
 '/Users/MLS/Downloads/model-rf.pkl_156.npy',
 '/Users/MLS/Downloads/model-rf.pkl_157.npy',
 '/Users/MLS/Downloads/model-rf.pkl_158.npy',
 '/Users/MLS/Downloads/model-rf.pkl_159.npy',
 '/Users/MLS/Downloads/model-rf.pkl_160.npy',
 '/Users/MLS/Downloads/model-rf.pkl_161.npy',
 '/Users/MLS/Downloads/model-rf.pkl_162.npy',
 '/Users/MLS/Downloads/model-rf.pkl_163.npy',
 '/Users/MLS/Downloads/model-rf.pkl_164.npy',
 '/Users/MLS/Downloads/model-rf.pkl_165.npy',
 '/Users/MLS/Downloads/model-rf.pkl_166.npy',
 '/Users/MLS/Downloads/model-rf.pkl_167.npy',
 '/Users/MLS/Downloads/model-rf.pkl_168.npy',
 '/Users/MLS/Downloads/model-rf.pkl_169.npy',
 '/Users/MLS/Downloads/model-rf.pkl_170.npy',
 '/Users/MLS/Downloads/model-rf.pkl_171.npy',
 '/Users/MLS/Downloads/model-rf.pkl_172.npy',
 '/Users/MLS/Downloads/model-rf.pkl_173.npy',
 '/Users/MLS/Downloads/model-rf.pkl_174.npy',
 '/Users/MLS/Downloads/model-rf.pkl_175.npy',
 '/Users/MLS/Downloads/model-rf.pkl_176.npy',
 '/Users/MLS/Downloads/model-rf.pkl_177.npy',
 '/Users/MLS/Downloads/model-rf.pkl_178.npy',
 '/Users/MLS/Downloads/model-rf.pkl_179.npy',
 '/Users/MLS/Downloads/model-rf.pkl_180.npy',
 '/Users/MLS/Downloads/model-rf.pkl_181.npy',
 '/Users/MLS/Downloads/model-rf.pkl_182.npy',
 '/Users/MLS/Downloads/model-rf.pkl_183.npy',
 '/Users/MLS/Downloads/model-rf.pkl_184.npy',
 '/Users/MLS/Downloads/model-rf.pkl_185.npy',
 '/Users/MLS/Downloads/model-rf.pkl_186.npy',
 '/Users/MLS/Downloads/model-rf.pkl_187.npy',
 '/Users/MLS/Downloads/model-rf.pkl_188.npy',
 '/Users/MLS/Downloads/model-rf.pkl_189.npy',
 '/Users/MLS/Downloads/model-rf.pkl_190.npy',
 '/Users/MLS/Downloads/model-rf.pkl_191.npy',
 '/Users/MLS/Downloads/model-rf.pkl_192.npy',
 '/Users/MLS/Downloads/model-rf.pkl_193.npy',
 '/Users/MLS/Downloads/model-rf.pkl_194.npy',
 '/Users/MLS/Downloads/model-rf.pkl_195.npy',
 '/Users/MLS/Downloads/model-rf.pkl_196.npy',
 '/Users/MLS/Downloads/model-rf.pkl_197.npy',
 '/Users/MLS/Downloads/model-rf.pkl_198.npy',
 '/Users/MLS/Downloads/model-rf.pkl_199.npy',
 '/Users/MLS/Downloads/model-rf.pkl_200.npy',
 '/Users/MLS/Downloads/model-rf.pkl_201.npy',
 '/Users/MLS/Downloads/model-rf.pkl_202.npy',
 '/Users/MLS/Downloads/model-rf.pkl_203.npy',
 '/Users/MLS/Downloads/model-rf.pkl_204.npy',
 '/Users/MLS/Downloads/model-rf.pkl_205.npy',
 '/Users/MLS/Downloads/model-rf.pkl_206.npy',
 '/Users/MLS/Downloads/model-rf.pkl_207.npy',
 '/Users/MLS/Downloads/model-rf.pkl_208.npy',
 '/Users/MLS/Downloads/model-rf.pkl_209.npy',
 '/Users/MLS/Downloads/model-rf.pkl_210.npy',
 '/Users/MLS/Downloads/model-rf.pkl_211.npy',
 '/Users/MLS/Downloads/model-rf.pkl_212.npy',
 '/Users/MLS/Downloads/model-rf.pkl_213.npy',
 '/Users/MLS/Downloads/model-rf.pkl_214.npy',
 '/Users/MLS/Downloads/model-rf.pkl_215.npy',
 '/Users/MLS/Downloads/model-rf.pkl_216.npy',
 '/Users/MLS/Downloads/model-rf.pkl_217.npy',
 '/Users/MLS/Downloads/model-rf.pkl_218.npy',
 '/Users/MLS/Downloads/model-rf.pkl_219.npy',
 '/Users/MLS/Downloads/model-rf.pkl_220.npy',
 '/Users/MLS/Downloads/model-rf.pkl_221.npy',
 '/Users/MLS/Downloads/model-rf.pkl_222.npy',
 '/Users/MLS/Downloads/model-rf.pkl_223.npy',
 '/Users/MLS/Downloads/model-rf.pkl_224.npy',
 '/Users/MLS/Downloads/model-rf.pkl_225.npy',
 '/Users/MLS/Downloads/model-rf.pkl_226.npy',
 '/Users/MLS/Downloads/model-rf.pkl_227.npy',
 '/Users/MLS/Downloads/model-rf.pkl_228.npy',
 '/Users/MLS/Downloads/model-rf.pkl_229.npy',
 '/Users/MLS/Downloads/model-rf.pkl_230.npy',
 '/Users/MLS/Downloads/model-rf.pkl_231.npy',
 '/Users/MLS/Downloads/model-rf.pkl_232.npy',
 '/Users/MLS/Downloads/model-rf.pkl_233.npy',
 '/Users/MLS/Downloads/model-rf.pkl_234.npy',
 '/Users/MLS/Downloads/model-rf.pkl_235.npy',
 '/Users/MLS/Downloads/model-rf.pkl_236.npy',
 '/Users/MLS/Downloads/model-rf.pkl_237.npy',
 '/Users/MLS/Downloads/model-rf.pkl_238.npy',
 '/Users/MLS/Downloads/model-rf.pkl_239.npy',
 '/Users/MLS/Downloads/model-rf.pkl_240.npy',
 '/Users/MLS/Downloads/model-rf.pkl_241.npy',
 '/Users/MLS/Downloads/model-rf.pkl_242.npy',
 '/Users/MLS/Downloads/model-rf.pkl_243.npy',
 '/Users/MLS/Downloads/model-rf.pkl_244.npy',
 '/Users/MLS/Downloads/model-rf.pkl_245.npy',
 '/Users/MLS/Downloads/model-rf.pkl_246.npy',
 '/Users/MLS/Downloads/model-rf.pkl_247.npy',
 '/Users/MLS/Downloads/model-rf.pkl_248.npy',
 '/Users/MLS/Downloads/model-rf.pkl_249.npy',
 '/Users/MLS/Downloads/model-rf.pkl_250.npy',
 '/Users/MLS/Downloads/model-rf.pkl_251.npy',
 '/Users/MLS/Downloads/model-rf.pkl_252.npy',
 '/Users/MLS/Downloads/model-rf.pkl_253.npy',
 '/Users/MLS/Downloads/model-rf.pkl_254.npy',
 '/Users/MLS/Downloads/model-rf.pkl_255.npy',
 '/Users/MLS/Downloads/model-rf.pkl_256.npy',
 '/Users/MLS/Downloads/model-rf.pkl_257.npy',
 '/Users/MLS/Downloads/model-rf.pkl_258.npy',
 '/Users/MLS/Downloads/model-rf.pkl_259.npy',
 '/Users/MLS/Downloads/model-rf.pkl_260.npy',
 '/Users/MLS/Downloads/model-rf.pkl_261.npy',
 '/Users/MLS/Downloads/model-rf.pkl_262.npy',
 '/Users/MLS/Downloads/model-rf.pkl_263.npy',
 '/Users/MLS/Downloads/model-rf.pkl_264.npy',
 '/Users/MLS/Downloads/model-rf.pkl_265.npy',
 '/Users/MLS/Downloads/model-rf.pkl_266.npy',
 '/Users/MLS/Downloads/model-rf.pkl_267.npy',
 '/Users/MLS/Downloads/model-rf.pkl_268.npy',
 '/Users/MLS/Downloads/model-rf.pkl_269.npy',
 '/Users/MLS/Downloads/model-rf.pkl_270.npy',
 '/Users/MLS/Downloads/model-rf.pkl_271.npy',
 '/Users/MLS/Downloads/model-rf.pkl_272.npy',
 '/Users/MLS/Downloads/model-rf.pkl_273.npy',
 '/Users/MLS/Downloads/model-rf.pkl_274.npy',
 '/Users/MLS/Downloads/model-rf.pkl_275.npy',
 '/Users/MLS/Downloads/model-rf.pkl_276.npy',
 '/Users/MLS/Downloads/model-rf.pkl_277.npy',
 '/Users/MLS/Downloads/model-rf.pkl_278.npy',
 '/Users/MLS/Downloads/model-rf.pkl_279.npy',
 '/Users/MLS/Downloads/model-rf.pkl_280.npy',
 '/Users/MLS/Downloads/model-rf.pkl_281.npy',
 '/Users/MLS/Downloads/model-rf.pkl_282.npy',
 '/Users/MLS/Downloads/model-rf.pkl_283.npy',
 '/Users/MLS/Downloads/model-rf.pkl_284.npy',
 '/Users/MLS/Downloads/model-rf.pkl_285.npy',
 '/Users/MLS/Downloads/model-rf.pkl_286.npy',
 '/Users/MLS/Downloads/model-rf.pkl_287.npy',
 '/Users/MLS/Downloads/model-rf.pkl_288.npy',
 '/Users/MLS/Downloads/model-rf.pkl_289.npy',
 '/Users/MLS/Downloads/model-rf.pkl_290.npy',
 '/Users/MLS/Downloads/model-rf.pkl_291.npy',
 '/Users/MLS/Downloads/model-rf.pkl_292.npy',
 '/Users/MLS/Downloads/model-rf.pkl_293.npy',
 '/Users/MLS/Downloads/model-rf.pkl_294.npy',
 '/Users/MLS/Downloads/model-rf.pkl_295.npy',
 '/Users/MLS/Downloads/model-rf.pkl_296.npy',
 '/Users/MLS/Downloads/model-rf.pkl_297.npy',
 '/Users/MLS/Downloads/model-rf.pkl_298.npy',
 '/Users/MLS/Downloads/model-rf.pkl_299.npy',
 '/Users/MLS/Downloads/model-rf.pkl_300.npy',
 '/Users/MLS/Downloads/model-rf.pkl_301.npy',
 '/Users/MLS/Downloads/model-rf.pkl_302.npy',
 '/Users/MLS/Downloads/model-rf.pkl_303.npy',
 '/Users/MLS/Downloads/model-rf.pkl_304.npy',
 '/Users/MLS/Downloads/model-rf.pkl_305.npy',
 '/Users/MLS/Downloads/model-rf.pkl_306.npy',
 '/Users/MLS/Downloads/model-rf.pkl_307.npy',
 '/Users/MLS/Downloads/model-rf.pkl_308.npy',
 '/Users/MLS/Downloads/model-rf.pkl_309.npy',
 '/Users/MLS/Downloads/model-rf.pkl_310.npy',
 '/Users/MLS/Downloads/model-rf.pkl_311.npy',
 '/Users/MLS/Downloads/model-rf.pkl_312.npy',
 '/Users/MLS/Downloads/model-rf.pkl_313.npy',
 '/Users/MLS/Downloads/model-rf.pkl_314.npy',
 '/Users/MLS/Downloads/model-rf.pkl_315.npy',
 '/Users/MLS/Downloads/model-rf.pkl_316.npy',
 '/Users/MLS/Downloads/model-rf.pkl_317.npy',
 '/Users/MLS/Downloads/model-rf.pkl_318.npy',
 '/Users/MLS/Downloads/model-rf.pkl_319.npy',
 '/Users/MLS/Downloads/model-rf.pkl_320.npy',
 '/Users/MLS/Downloads/model-rf.pkl_321.npy',
 '/Users/MLS/Downloads/model-rf.pkl_322.npy',
 '/Users/MLS/Downloads/model-rf.pkl_323.npy',
 '/Users/MLS/Downloads/model-rf.pkl_324.npy',
 '/Users/MLS/Downloads/model-rf.pkl_325.npy',
 '/Users/MLS/Downloads/model-rf.pkl_326.npy',
 '/Users/MLS/Downloads/model-rf.pkl_327.npy',
 '/Users/MLS/Downloads/model-rf.pkl_328.npy',
 '/Users/MLS/Downloads/model-rf.pkl_329.npy',
 '/Users/MLS/Downloads/model-rf.pkl_330.npy',
 '/Users/MLS/Downloads/model-rf.pkl_331.npy',
 '/Users/MLS/Downloads/model-rf.pkl_332.npy',
 '/Users/MLS/Downloads/model-rf.pkl_333.npy',
 '/Users/MLS/Downloads/model-rf.pkl_334.npy',
 '/Users/MLS/Downloads/model-rf.pkl_335.npy',
 '/Users/MLS/Downloads/model-rf.pkl_336.npy',
 '/Users/MLS/Downloads/model-rf.pkl_337.npy',
 '/Users/MLS/Downloads/model-rf.pkl_338.npy',
 '/Users/MLS/Downloads/model-rf.pkl_339.npy',
 '/Users/MLS/Downloads/model-rf.pkl_340.npy',
 '/Users/MLS/Downloads/model-rf.pkl_341.npy',
 '/Users/MLS/Downloads/model-rf.pkl_342.npy',
 '/Users/MLS/Downloads/model-rf.pkl_343.npy',
 '/Users/MLS/Downloads/model-rf.pkl_344.npy',
 '/Users/MLS/Downloads/model-rf.pkl_345.npy',
 '/Users/MLS/Downloads/model-rf.pkl_346.npy',
 '/Users/MLS/Downloads/model-rf.pkl_347.npy',
 '/Users/MLS/Downloads/model-rf.pkl_348.npy',
 '/Users/MLS/Downloads/model-rf.pkl_349.npy',
 '/Users/MLS/Downloads/model-rf.pkl_350.npy',
 '/Users/MLS/Downloads/model-rf.pkl_351.npy',
 '/Users/MLS/Downloads/model-rf.pkl_352.npy',
 '/Users/MLS/Downloads/model-rf.pkl_353.npy',
 '/Users/MLS/Downloads/model-rf.pkl_354.npy',
 '/Users/MLS/Downloads/model-rf.pkl_355.npy',
 '/Users/MLS/Downloads/model-rf.pkl_356.npy',
 '/Users/MLS/Downloads/model-rf.pkl_357.npy',
 '/Users/MLS/Downloads/model-rf.pkl_358.npy',
 '/Users/MLS/Downloads/model-rf.pkl_359.npy',
 '/Users/MLS/Downloads/model-rf.pkl_360.npy',
 '/Users/MLS/Downloads/model-rf.pkl_361.npy',
 '/Users/MLS/Downloads/model-rf.pkl_362.npy',
 '/Users/MLS/Downloads/model-rf.pkl_363.npy',
 '/Users/MLS/Downloads/model-rf.pkl_364.npy',
 '/Users/MLS/Downloads/model-rf.pkl_365.npy',
 '/Users/MLS/Downloads/model-rf.pkl_366.npy',
 '/Users/MLS/Downloads/model-rf.pkl_367.npy',
 '/Users/MLS/Downloads/model-rf.pkl_368.npy',
 '/Users/MLS/Downloads/model-rf.pkl_369.npy',
 '/Users/MLS/Downloads/model-rf.pkl_370.npy',
 '/Users/MLS/Downloads/model-rf.pkl_371.npy',
 '/Users/MLS/Downloads/model-rf.pkl_372.npy',
 '/Users/MLS/Downloads/model-rf.pkl_373.npy',
 '/Users/MLS/Downloads/model-rf.pkl_374.npy',
 '/Users/MLS/Downloads/model-rf.pkl_375.npy',
 '/Users/MLS/Downloads/model-rf.pkl_376.npy',
 '/Users/MLS/Downloads/model-rf.pkl_377.npy',
 '/Users/MLS/Downloads/model-rf.pkl_378.npy',
 '/Users/MLS/Downloads/model-rf.pkl_379.npy',
 '/Users/MLS/Downloads/model-rf.pkl_380.npy',
 '/Users/MLS/Downloads/model-rf.pkl_381.npy',
 '/Users/MLS/Downloads/model-rf.pkl_382.npy',
 '/Users/MLS/Downloads/model-rf.pkl_383.npy',
 '/Users/MLS/Downloads/model-rf.pkl_384.npy',
 '/Users/MLS/Downloads/model-rf.pkl_385.npy',
 '/Users/MLS/Downloads/model-rf.pkl_386.npy',
 '/Users/MLS/Downloads/model-rf.pkl_387.npy',
 '/Users/MLS/Downloads/model-rf.pkl_388.npy',
 '/Users/MLS/Downloads/model-rf.pkl_389.npy',
 '/Users/MLS/Downloads/model-rf.pkl_390.npy',
 '/Users/MLS/Downloads/model-rf.pkl_391.npy',
 '/Users/MLS/Downloads/model-rf.pkl_392.npy',
 '/Users/MLS/Downloads/model-rf.pkl_393.npy',
 '/Users/MLS/Downloads/model-rf.pkl_394.npy',
 '/Users/MLS/Downloads/model-rf.pkl_395.npy',
 '/Users/MLS/Downloads/model-rf.pkl_396.npy',
 '/Users/MLS/Downloads/model-rf.pkl_397.npy',
 '/Users/MLS/Downloads/model-rf.pkl_398.npy',
 '/Users/MLS/Downloads/model-rf.pkl_399.npy',
 '/Users/MLS/Downloads/model-rf.pkl_400.npy',
 '/Users/MLS/Downloads/model-rf.pkl_401.npy',
 '/Users/MLS/Downloads/model-rf.pkl_402.npy',
 '/Users/MLS/Downloads/model-rf.pkl_403.npy',
 '/Users/MLS/Downloads/model-rf.pkl_404.npy',
 '/Users/MLS/Downloads/model-rf.pkl_405.npy',
 '/Users/MLS/Downloads/model-rf.pkl_406.npy',
 '/Users/MLS/Downloads/model-rf.pkl_407.npy',
 '/Users/MLS/Downloads/model-rf.pkl_408.npy',
 '/Users/MLS/Downloads/model-rf.pkl_409.npy',
 '/Users/MLS/Downloads/model-rf.pkl_410.npy',
 '/Users/MLS/Downloads/model-rf.pkl_411.npy',
 '/Users/MLS/Downloads/model-rf.pkl_412.npy',
 '/Users/MLS/Downloads/model-rf.pkl_413.npy',
 '/Users/MLS/Downloads/model-rf.pkl_414.npy',
 '/Users/MLS/Downloads/model-rf.pkl_415.npy',
 '/Users/MLS/Downloads/model-rf.pkl_416.npy',
 '/Users/MLS/Downloads/model-rf.pkl_417.npy',
 '/Users/MLS/Downloads/model-rf.pkl_418.npy',
 '/Users/MLS/Downloads/model-rf.pkl_419.npy',
 '/Users/MLS/Downloads/model-rf.pkl_420.npy',
 '/Users/MLS/Downloads/model-rf.pkl_421.npy',
 '/Users/MLS/Downloads/model-rf.pkl_422.npy',
 '/Users/MLS/Downloads/model-rf.pkl_423.npy',
 '/Users/MLS/Downloads/model-rf.pkl_424.npy',
 '/Users/MLS/Downloads/model-rf.pkl_425.npy',
 '/Users/MLS/Downloads/model-rf.pkl_426.npy',
 '/Users/MLS/Downloads/model-rf.pkl_427.npy',
 '/Users/MLS/Downloads/model-rf.pkl_428.npy',
 '/Users/MLS/Downloads/model-rf.pkl_429.npy',
 '/Users/MLS/Downloads/model-rf.pkl_430.npy',
 '/Users/MLS/Downloads/model-rf.pkl_431.npy',
 '/Users/MLS/Downloads/model-rf.pkl_432.npy',
 '/Users/MLS/Downloads/model-rf.pkl_433.npy',
 '/Users/MLS/Downloads/model-rf.pkl_434.npy',
 '/Users/MLS/Downloads/model-rf.pkl_435.npy',
 '/Users/MLS/Downloads/model-rf.pkl_436.npy',
 '/Users/MLS/Downloads/model-rf.pkl_437.npy',
 '/Users/MLS/Downloads/model-rf.pkl_438.npy',
 '/Users/MLS/Downloads/model-rf.pkl_439.npy',
 '/Users/MLS/Downloads/model-rf.pkl_440.npy',
 '/Users/MLS/Downloads/model-rf.pkl_441.npy',
 '/Users/MLS/Downloads/model-rf.pkl_442.npy',
 '/Users/MLS/Downloads/model-rf.pkl_443.npy',
 '/Users/MLS/Downloads/model-rf.pkl_444.npy',
 '/Users/MLS/Downloads/model-rf.pkl_445.npy',
 '/Users/MLS/Downloads/model-rf.pkl_446.npy',
 '/Users/MLS/Downloads/model-rf.pkl_447.npy',
 '/Users/MLS/Downloads/model-rf.pkl_448.npy',
 '/Users/MLS/Downloads/model-rf.pkl_449.npy',
 '/Users/MLS/Downloads/model-rf.pkl_450.npy',
 '/Users/MLS/Downloads/model-rf.pkl_451.npy',
 '/Users/MLS/Downloads/model-rf.pkl_452.npy',
 '/Users/MLS/Downloads/model-rf.pkl_453.npy',
 '/Users/MLS/Downloads/model-rf.pkl_454.npy',
 '/Users/MLS/Downloads/model-rf.pkl_455.npy',
 '/Users/MLS/Downloads/model-rf.pkl_456.npy',
 '/Users/MLS/Downloads/model-rf.pkl_457.npy',
 '/Users/MLS/Downloads/model-rf.pkl_458.npy',
 '/Users/MLS/Downloads/model-rf.pkl_459.npy',
 '/Users/MLS/Downloads/model-rf.pkl_460.npy',
 '/Users/MLS/Downloads/model-rf.pkl_461.npy',
 '/Users/MLS/Downloads/model-rf.pkl_462.npy',
 '/Users/MLS/Downloads/model-rf.pkl_463.npy',
 '/Users/MLS/Downloads/model-rf.pkl_464.npy',
 '/Users/MLS/Downloads/model-rf.pkl_465.npy',
 '/Users/MLS/Downloads/model-rf.pkl_466.npy',
 '/Users/MLS/Downloads/model-rf.pkl_467.npy',
 '/Users/MLS/Downloads/model-rf.pkl_468.npy',
 '/Users/MLS/Downloads/model-rf.pkl_469.npy',
 '/Users/MLS/Downloads/model-rf.pkl_470.npy',
 '/Users/MLS/Downloads/model-rf.pkl_471.npy',
 '/Users/MLS/Downloads/model-rf.pkl_472.npy',
 '/Users/MLS/Downloads/model-rf.pkl_473.npy',
 '/Users/MLS/Downloads/model-rf.pkl_474.npy',
 '/Users/MLS/Downloads/model-rf.pkl_475.npy',
 '/Users/MLS/Downloads/model-rf.pkl_476.npy',
 '/Users/MLS/Downloads/model-rf.pkl_477.npy',
 '/Users/MLS/Downloads/model-rf.pkl_478.npy',
 '/Users/MLS/Downloads/model-rf.pkl_479.npy',
 '/Users/MLS/Downloads/model-rf.pkl_480.npy',
 '/Users/MLS/Downloads/model-rf.pkl_481.npy',
 '/Users/MLS/Downloads/model-rf.pkl_482.npy',
 '/Users/MLS/Downloads/model-rf.pkl_483.npy',
 '/Users/MLS/Downloads/model-rf.pkl_484.npy',
 '/Users/MLS/Downloads/model-rf.pkl_485.npy',
 '/Users/MLS/Downloads/model-rf.pkl_486.npy',
 '/Users/MLS/Downloads/model-rf.pkl_487.npy',
 '/Users/MLS/Downloads/model-rf.pkl_488.npy',
 '/Users/MLS/Downloads/model-rf.pkl_489.npy',
 '/Users/MLS/Downloads/model-rf.pkl_490.npy',
 '/Users/MLS/Downloads/model-rf.pkl_491.npy',
 '/Users/MLS/Downloads/model-rf.pkl_492.npy',
 '/Users/MLS/Downloads/model-rf.pkl_493.npy',
 '/Users/MLS/Downloads/model-rf.pkl_494.npy',
 '/Users/MLS/Downloads/model-rf.pkl_495.npy',
 '/Users/MLS/Downloads/model-rf.pkl_496.npy',
 '/Users/MLS/Downloads/model-rf.pkl_497.npy',
 '/Users/MLS/Downloads/model-rf.pkl_498.npy',
 '/Users/MLS/Downloads/model-rf.pkl_499.npy',
 '/Users/MLS/Downloads/model-rf.pkl_500.npy',
 '/Users/MLS/Downloads/model-rf.pkl_501.npy',
 '/Users/MLS/Downloads/model-rf.pkl_502.npy',
 '/Users/MLS/Downloads/model-rf.pkl_503.npy',
 '/Users/MLS/Downloads/model-rf.pkl_504.npy',
 '/Users/MLS/Downloads/model-rf.pkl_505.npy',
 '/Users/MLS/Downloads/model-rf.pkl_506.npy',
 '/Users/MLS/Downloads/model-rf.pkl_507.npy',
 '/Users/MLS/Downloads/model-rf.pkl_508.npy',
 '/Users/MLS/Downloads/model-rf.pkl_509.npy',
 '/Users/MLS/Downloads/model-rf.pkl_510.npy',
 '/Users/MLS/Downloads/model-rf.pkl_511.npy',
 '/Users/MLS/Downloads/model-rf.pkl_512.npy',
 '/Users/MLS/Downloads/model-rf.pkl_513.npy',
 '/Users/MLS/Downloads/model-rf.pkl_514.npy',
 '/Users/MLS/Downloads/model-rf.pkl_515.npy',
 '/Users/MLS/Downloads/model-rf.pkl_516.npy',
 '/Users/MLS/Downloads/model-rf.pkl_517.npy',
 '/Users/MLS/Downloads/model-rf.pkl_518.npy',
 '/Users/MLS/Downloads/model-rf.pkl_519.npy',
 '/Users/MLS/Downloads/model-rf.pkl_520.npy',
 '/Users/MLS/Downloads/model-rf.pkl_521.npy',
 '/Users/MLS/Downloads/model-rf.pkl_522.npy',
 '/Users/MLS/Downloads/model-rf.pkl_523.npy',
 '/Users/MLS/Downloads/model-rf.pkl_524.npy',
 '/Users/MLS/Downloads/model-rf.pkl_525.npy',
 '/Users/MLS/Downloads/model-rf.pkl_526.npy',
 '/Users/MLS/Downloads/model-rf.pkl_527.npy',
 '/Users/MLS/Downloads/model-rf.pkl_528.npy',
 '/Users/MLS/Downloads/model-rf.pkl_529.npy',
 '/Users/MLS/Downloads/model-rf.pkl_530.npy',
 '/Users/MLS/Downloads/model-rf.pkl_531.npy',
 '/Users/MLS/Downloads/model-rf.pkl_532.npy',
 '/Users/MLS/Downloads/model-rf.pkl_533.npy',
 '/Users/MLS/Downloads/model-rf.pkl_534.npy',
 '/Users/MLS/Downloads/model-rf.pkl_535.npy',
 '/Users/MLS/Downloads/model-rf.pkl_536.npy',
 '/Users/MLS/Downloads/model-rf.pkl_537.npy',
 '/Users/MLS/Downloads/model-rf.pkl_538.npy',
 '/Users/MLS/Downloads/model-rf.pkl_539.npy',
 '/Users/MLS/Downloads/model-rf.pkl_540.npy',
 '/Users/MLS/Downloads/model-rf.pkl_541.npy',
 '/Users/MLS/Downloads/model-rf.pkl_542.npy',
 '/Users/MLS/Downloads/model-rf.pkl_543.npy',
 '/Users/MLS/Downloads/model-rf.pkl_544.npy',
 '/Users/MLS/Downloads/model-rf.pkl_545.npy',
 '/Users/MLS/Downloads/model-rf.pkl_546.npy',
 '/Users/MLS/Downloads/model-rf.pkl_547.npy',
 '/Users/MLS/Downloads/model-rf.pkl_548.npy',
 '/Users/MLS/Downloads/model-rf.pkl_549.npy',
 '/Users/MLS/Downloads/model-rf.pkl_550.npy',
 '/Users/MLS/Downloads/model-rf.pkl_551.npy',
 '/Users/MLS/Downloads/model-rf.pkl_552.npy',
 '/Users/MLS/Downloads/model-rf.pkl_553.npy',
 '/Users/MLS/Downloads/model-rf.pkl_554.npy',
 '/Users/MLS/Downloads/model-rf.pkl_555.npy',
 '/Users/MLS/Downloads/model-rf.pkl_556.npy',
 '/Users/MLS/Downloads/model-rf.pkl_557.npy',
 '/Users/MLS/Downloads/model-rf.pkl_558.npy',
 '/Users/MLS/Downloads/model-rf.pkl_559.npy',
 '/Users/MLS/Downloads/model-rf.pkl_560.npy',
 '/Users/MLS/Downloads/model-rf.pkl_561.npy',
 '/Users/MLS/Downloads/model-rf.pkl_562.npy',
 '/Users/MLS/Downloads/model-rf.pkl_563.npy',
 '/Users/MLS/Downloads/model-rf.pkl_564.npy',
 '/Users/MLS/Downloads/model-rf.pkl_565.npy',
 '/Users/MLS/Downloads/model-rf.pkl_566.npy',
 '/Users/MLS/Downloads/model-rf.pkl_567.npy',
 '/Users/MLS/Downloads/model-rf.pkl_568.npy',
 '/Users/MLS/Downloads/model-rf.pkl_569.npy',
 '/Users/MLS/Downloads/model-rf.pkl_570.npy',
 '/Users/MLS/Downloads/model-rf.pkl_571.npy',
 '/Users/MLS/Downloads/model-rf.pkl_572.npy',
 '/Users/MLS/Downloads/model-rf.pkl_573.npy',
 '/Users/MLS/Downloads/model-rf.pkl_574.npy',
 '/Users/MLS/Downloads/model-rf.pkl_575.npy',
 '/Users/MLS/Downloads/model-rf.pkl_576.npy',
 '/Users/MLS/Downloads/model-rf.pkl_577.npy',
 '/Users/MLS/Downloads/model-rf.pkl_578.npy',
 '/Users/MLS/Downloads/model-rf.pkl_579.npy',
 '/Users/MLS/Downloads/model-rf.pkl_580.npy',
 '/Users/MLS/Downloads/model-rf.pkl_581.npy',
 '/Users/MLS/Downloads/model-rf.pkl_582.npy',
 '/Users/MLS/Downloads/model-rf.pkl_583.npy',
 '/Users/MLS/Downloads/model-rf.pkl_584.npy',
 '/Users/MLS/Downloads/model-rf.pkl_585.npy',
 '/Users/MLS/Downloads/model-rf.pkl_586.npy',
 '/Users/MLS/Downloads/model-rf.pkl_587.npy',
 '/Users/MLS/Downloads/model-rf.pkl_588.npy',
 '/Users/MLS/Downloads/model-rf.pkl_589.npy',
 '/Users/MLS/Downloads/model-rf.pkl_590.npy',
 '/Users/MLS/Downloads/model-rf.pkl_591.npy',
 '/Users/MLS/Downloads/model-rf.pkl_592.npy',
 '/Users/MLS/Downloads/model-rf.pkl_593.npy',
 '/Users/MLS/Downloads/model-rf.pkl_594.npy',
 '/Users/MLS/Downloads/model-rf.pkl_595.npy',
 '/Users/MLS/Downloads/model-rf.pkl_596.npy',
 '/Users/MLS/Downloads/model-rf.pkl_597.npy',
 '/Users/MLS/Downloads/model-rf.pkl_598.npy',
 '/Users/MLS/Downloads/model-rf.pkl_599.npy',
 '/Users/MLS/Downloads/model-rf.pkl_600.npy',
 '/Users/MLS/Downloads/model-rf.pkl_601.npy',
 '/Users/MLS/Downloads/model-rf.pkl_602.npy',
 '/Users/MLS/Downloads/model-rf.pkl_603.npy',
 '/Users/MLS/Downloads/model-rf.pkl_604.npy',
 '/Users/MLS/Downloads/model-rf.pkl_605.npy',
 '/Users/MLS/Downloads/model-rf.pkl_606.npy',
 '/Users/MLS/Downloads/model-rf.pkl_607.npy',
 '/Users/MLS/Downloads/model-rf.pkl_608.npy',
 '/Users/MLS/Downloads/model-rf.pkl_609.npy',
 '/Users/MLS/Downloads/model-rf.pkl_610.npy',
 '/Users/MLS/Downloads/model-rf.pkl_611.npy',
 '/Users/MLS/Downloads/model-rf.pkl_612.npy',
 '/Users/MLS/Downloads/model-rf.pkl_613.npy',
 '/Users/MLS/Downloads/model-rf.pkl_614.npy',
 '/Users/MLS/Downloads/model-rf.pkl_615.npy',
 '/Users/MLS/Downloads/model-rf.pkl_616.npy',
 '/Users/MLS/Downloads/model-rf.pkl_617.npy',
 '/Users/MLS/Downloads/model-rf.pkl_618.npy',
 '/Users/MLS/Downloads/model-rf.pkl_619.npy',
 '/Users/MLS/Downloads/model-rf.pkl_620.npy',
 '/Users/MLS/Downloads/model-rf.pkl_621.npy',
 '/Users/MLS/Downloads/model-rf.pkl_622.npy',
 '/Users/MLS/Downloads/model-rf.pkl_623.npy',
 '/Users/MLS/Downloads/model-rf.pkl_624.npy',
 '/Users/MLS/Downloads/model-rf.pkl_625.npy',
 '/Users/MLS/Downloads/model-rf.pkl_626.npy',
 '/Users/MLS/Downloads/model-rf.pkl_627.npy',
 '/Users/MLS/Downloads/model-rf.pkl_628.npy',
 '/Users/MLS/Downloads/model-rf.pkl_629.npy',
 '/Users/MLS/Downloads/model-rf.pkl_630.npy',
 '/Users/MLS/Downloads/model-rf.pkl_631.npy',
 '/Users/MLS/Downloads/model-rf.pkl_632.npy',
 '/Users/MLS/Downloads/model-rf.pkl_633.npy',
 '/Users/MLS/Downloads/model-rf.pkl_634.npy',
 '/Users/MLS/Downloads/model-rf.pkl_635.npy',
 '/Users/MLS/Downloads/model-rf.pkl_636.npy',
 '/Users/MLS/Downloads/model-rf.pkl_637.npy',
 '/Users/MLS/Downloads/model-rf.pkl_638.npy',
 '/Users/MLS/Downloads/model-rf.pkl_639.npy',
 '/Users/MLS/Downloads/model-rf.pkl_640.npy',
 '/Users/MLS/Downloads/model-rf.pkl_641.npy',
 '/Users/MLS/Downloads/model-rf.pkl_642.npy',
 '/Users/MLS/Downloads/model-rf.pkl_643.npy',
 '/Users/MLS/Downloads/model-rf.pkl_644.npy',
 '/Users/MLS/Downloads/model-rf.pkl_645.npy',
 '/Users/MLS/Downloads/model-rf.pkl_646.npy',
 '/Users/MLS/Downloads/model-rf.pkl_647.npy',
 '/Users/MLS/Downloads/model-rf.pkl_648.npy',
 '/Users/MLS/Downloads/model-rf.pkl_649.npy',
 '/Users/MLS/Downloads/model-rf.pkl_650.npy',
 '/Users/MLS/Downloads/model-rf.pkl_651.npy',
 '/Users/MLS/Downloads/model-rf.pkl_652.npy',
 '/Users/MLS/Downloads/model-rf.pkl_653.npy',
 '/Users/MLS/Downloads/model-rf.pkl_654.npy',
 '/Users/MLS/Downloads/model-rf.pkl_655.npy',
 '/Users/MLS/Downloads/model-rf.pkl_656.npy',
 '/Users/MLS/Downloads/model-rf.pkl_657.npy',
 '/Users/MLS/Downloads/model-rf.pkl_658.npy',
 '/Users/MLS/Downloads/model-rf.pkl_659.npy',
 '/Users/MLS/Downloads/model-rf.pkl_660.npy',
 '/Users/MLS/Downloads/model-rf.pkl_661.npy',
 '/Users/MLS/Downloads/model-rf.pkl_662.npy',
 '/Users/MLS/Downloads/model-rf.pkl_663.npy',
 '/Users/MLS/Downloads/model-rf.pkl_664.npy',
 '/Users/MLS/Downloads/model-rf.pkl_665.npy',
 '/Users/MLS/Downloads/model-rf.pkl_666.npy',
 '/Users/MLS/Downloads/model-rf.pkl_667.npy',
 '/Users/MLS/Downloads/model-rf.pkl_668.npy',
 '/Users/MLS/Downloads/model-rf.pkl_669.npy',
 '/Users/MLS/Downloads/model-rf.pkl_670.npy',
 '/Users/MLS/Downloads/model-rf.pkl_671.npy',
 '/Users/MLS/Downloads/model-rf.pkl_672.npy',
 '/Users/MLS/Downloads/model-rf.pkl_673.npy',
 '/Users/MLS/Downloads/model-rf.pkl_674.npy',
 '/Users/MLS/Downloads/model-rf.pkl_675.npy',
 '/Users/MLS/Downloads/model-rf.pkl_676.npy',
 '/Users/MLS/Downloads/model-rf.pkl_677.npy',
 '/Users/MLS/Downloads/model-rf.pkl_678.npy',
 '/Users/MLS/Downloads/model-rf.pkl_679.npy',
 '/Users/MLS/Downloads/model-rf.pkl_680.npy',
 '/Users/MLS/Downloads/model-rf.pkl_681.npy',
 '/Users/MLS/Downloads/model-rf.pkl_682.npy',
 '/Users/MLS/Downloads/model-rf.pkl_683.npy',
 '/Users/MLS/Downloads/model-rf.pkl_684.npy',
 '/Users/MLS/Downloads/model-rf.pkl_685.npy',
 '/Users/MLS/Downloads/model-rf.pkl_686.npy',
 '/Users/MLS/Downloads/model-rf.pkl_687.npy',
 '/Users/MLS/Downloads/model-rf.pkl_688.npy',
 '/Users/MLS/Downloads/model-rf.pkl_689.npy',
 '/Users/MLS/Downloads/model-rf.pkl_690.npy',
 '/Users/MLS/Downloads/model-rf.pkl_691.npy',
 '/Users/MLS/Downloads/model-rf.pkl_692.npy',
 '/Users/MLS/Downloads/model-rf.pkl_693.npy',
 '/Users/MLS/Downloads/model-rf.pkl_694.npy',
 '/Users/MLS/Downloads/model-rf.pkl_695.npy',
 '/Users/MLS/Downloads/model-rf.pkl_696.npy',
 '/Users/MLS/Downloads/model-rf.pkl_697.npy',
 '/Users/MLS/Downloads/model-rf.pkl_698.npy',
 '/Users/MLS/Downloads/model-rf.pkl_699.npy',
 '/Users/MLS/Downloads/model-rf.pkl_700.npy',
 '/Users/MLS/Downloads/model-rf.pkl_701.npy',
 '/Users/MLS/Downloads/model-rf.pkl_702.npy',
 '/Users/MLS/Downloads/model-rf.pkl_703.npy',
 '/Users/MLS/Downloads/model-rf.pkl_704.npy',
 '/Users/MLS/Downloads/model-rf.pkl_705.npy',
 '/Users/MLS/Downloads/model-rf.pkl_706.npy',
 '/Users/MLS/Downloads/model-rf.pkl_707.npy',
 '/Users/MLS/Downloads/model-rf.pkl_708.npy',
 '/Users/MLS/Downloads/model-rf.pkl_709.npy',
 '/Users/MLS/Downloads/model-rf.pkl_710.npy',
 '/Users/MLS/Downloads/model-rf.pkl_711.npy',
 '/Users/MLS/Downloads/model-rf.pkl_712.npy',
 '/Users/MLS/Downloads/model-rf.pkl_713.npy',
 '/Users/MLS/Downloads/model-rf.pkl_714.npy',
 '/Users/MLS/Downloads/model-rf.pkl_715.npy',
 '/Users/MLS/Downloads/model-rf.pkl_716.npy',
 '/Users/MLS/Downloads/model-rf.pkl_717.npy',
 '/Users/MLS/Downloads/model-rf.pkl_718.npy',
 '/Users/MLS/Downloads/model-rf.pkl_719.npy',
 '/Users/MLS/Downloads/model-rf.pkl_720.npy',
 '/Users/MLS/Downloads/model-rf.pkl_721.npy',
 '/Users/MLS/Downloads/model-rf.pkl_722.npy',
 '/Users/MLS/Downloads/model-rf.pkl_723.npy',
 '/Users/MLS/Downloads/model-rf.pkl_724.npy',
 '/Users/MLS/Downloads/model-rf.pkl_725.npy',
 '/Users/MLS/Downloads/model-rf.pkl_726.npy',
 '/Users/MLS/Downloads/model-rf.pkl_727.npy',
 '/Users/MLS/Downloads/model-rf.pkl_728.npy',
 '/Users/MLS/Downloads/model-rf.pkl_729.npy',
 '/Users/MLS/Downloads/model-rf.pkl_730.npy',
 '/Users/MLS/Downloads/model-rf.pkl_731.npy',
 '/Users/MLS/Downloads/model-rf.pkl_732.npy',
 '/Users/MLS/Downloads/model-rf.pkl_733.npy',
 '/Users/MLS/Downloads/model-rf.pkl_734.npy',
 '/Users/MLS/Downloads/model-rf.pkl_735.npy',
 '/Users/MLS/Downloads/model-rf.pkl_736.npy',
 '/Users/MLS/Downloads/model-rf.pkl_737.npy',
 '/Users/MLS/Downloads/model-rf.pkl_738.npy',
 '/Users/MLS/Downloads/model-rf.pkl_739.npy',
 '/Users/MLS/Downloads/model-rf.pkl_740.npy',
 '/Users/MLS/Downloads/model-rf.pkl_741.npy',
 '/Users/MLS/Downloads/model-rf.pkl_742.npy',
 '/Users/MLS/Downloads/model-rf.pkl_743.npy',
 '/Users/MLS/Downloads/model-rf.pkl_744.npy',
 '/Users/MLS/Downloads/model-rf.pkl_745.npy',
 '/Users/MLS/Downloads/model-rf.pkl_746.npy',
 '/Users/MLS/Downloads/model-rf.pkl_747.npy',
 '/Users/MLS/Downloads/model-rf.pkl_748.npy',
 '/Users/MLS/Downloads/model-rf.pkl_749.npy',
 '/Users/MLS/Downloads/model-rf.pkl_750.npy',
 '/Users/MLS/Downloads/model-rf.pkl_751.npy',
 '/Users/MLS/Downloads/model-rf.pkl_752.npy',
 '/Users/MLS/Downloads/model-rf.pkl_753.npy',
 '/Users/MLS/Downloads/model-rf.pkl_754.npy',
 '/Users/MLS/Downloads/model-rf.pkl_755.npy',
 '/Users/MLS/Downloads/model-rf.pkl_756.npy',
 '/Users/MLS/Downloads/model-rf.pkl_757.npy',
 '/Users/MLS/Downloads/model-rf.pkl_758.npy',
 '/Users/MLS/Downloads/model-rf.pkl_759.npy',
 '/Users/MLS/Downloads/model-rf.pkl_760.npy',
 '/Users/MLS/Downloads/model-rf.pkl_761.npy',
 '/Users/MLS/Downloads/model-rf.pkl_762.npy',
 '/Users/MLS/Downloads/model-rf.pkl_763.npy',
 '/Users/MLS/Downloads/model-rf.pkl_764.npy',
 '/Users/MLS/Downloads/model-rf.pkl_765.npy',
 '/Users/MLS/Downloads/model-rf.pkl_766.npy',
 '/Users/MLS/Downloads/model-rf.pkl_767.npy',
 '/Users/MLS/Downloads/model-rf.pkl_768.npy',
 '/Users/MLS/Downloads/model-rf.pkl_769.npy',
 '/Users/MLS/Downloads/model-rf.pkl_770.npy',
 '/Users/MLS/Downloads/model-rf.pkl_771.npy',
 '/Users/MLS/Downloads/model-rf.pkl_772.npy',
 '/Users/MLS/Downloads/model-rf.pkl_773.npy',
 '/Users/MLS/Downloads/model-rf.pkl_774.npy',
 '/Users/MLS/Downloads/model-rf.pkl_775.npy',
 '/Users/MLS/Downloads/model-rf.pkl_776.npy',
 '/Users/MLS/Downloads/model-rf.pkl_777.npy',
 '/Users/MLS/Downloads/model-rf.pkl_778.npy',
 '/Users/MLS/Downloads/model-rf.pkl_779.npy',
 '/Users/MLS/Downloads/model-rf.pkl_780.npy',
 '/Users/MLS/Downloads/model-rf.pkl_781.npy',
 '/Users/MLS/Downloads/model-rf.pkl_782.npy',
 '/Users/MLS/Downloads/model-rf.pkl_783.npy',
 '/Users/MLS/Downloads/model-rf.pkl_784.npy',
 '/Users/MLS/Downloads/model-rf.pkl_785.npy',
 '/Users/MLS/Downloads/model-rf.pkl_786.npy',
 '/Users/MLS/Downloads/model-rf.pkl_787.npy',
 '/Users/MLS/Downloads/model-rf.pkl_788.npy',
 '/Users/MLS/Downloads/model-rf.pkl_789.npy',
 '/Users/MLS/Downloads/model-rf.pkl_790.npy',
 '/Users/MLS/Downloads/model-rf.pkl_791.npy',
 '/Users/MLS/Downloads/model-rf.pkl_792.npy',
 '/Users/MLS/Downloads/model-rf.pkl_793.npy',
 '/Users/MLS/Downloads/model-rf.pkl_794.npy',
 '/Users/MLS/Downloads/model-rf.pkl_795.npy',
 '/Users/MLS/Downloads/model-rf.pkl_796.npy',
 '/Users/MLS/Downloads/model-rf.pkl_797.npy',
 '/Users/MLS/Downloads/model-rf.pkl_798.npy',
 '/Users/MLS/Downloads/model-rf.pkl_799.npy',
 '/Users/MLS/Downloads/model-rf.pkl_800.npy',
 '/Users/MLS/Downloads/model-rf.pkl_801.npy',
 '/Users/MLS/Downloads/model-rf.pkl_802.npy',
 '/Users/MLS/Downloads/model-rf.pkl_803.npy',
 '/Users/MLS/Downloads/model-rf.pkl_804.npy',
 '/Users/MLS/Downloads/model-rf.pkl_805.npy',
 '/Users/MLS/Downloads/model-rf.pkl_806.npy',
 '/Users/MLS/Downloads/model-rf.pkl_807.npy',
 '/Users/MLS/Downloads/model-rf.pkl_808.npy',
 '/Users/MLS/Downloads/model-rf.pkl_809.npy',
 '/Users/MLS/Downloads/model-rf.pkl_810.npy',
 '/Users/MLS/Downloads/model-rf.pkl_811.npy',
 '/Users/MLS/Downloads/model-rf.pkl_812.npy',
 '/Users/MLS/Downloads/model-rf.pkl_813.npy',
 '/Users/MLS/Downloads/model-rf.pkl_814.npy',
 '/Users/MLS/Downloads/model-rf.pkl_815.npy',
 '/Users/MLS/Downloads/model-rf.pkl_816.npy',
 '/Users/MLS/Downloads/model-rf.pkl_817.npy',
 '/Users/MLS/Downloads/model-rf.pkl_818.npy',
 '/Users/MLS/Downloads/model-rf.pkl_819.npy',
 '/Users/MLS/Downloads/model-rf.pkl_820.npy',
 '/Users/MLS/Downloads/model-rf.pkl_821.npy',
 '/Users/MLS/Downloads/model-rf.pkl_822.npy',
 '/Users/MLS/Downloads/model-rf.pkl_823.npy',
 '/Users/MLS/Downloads/model-rf.pkl_824.npy',
 '/Users/MLS/Downloads/model-rf.pkl_825.npy',
 '/Users/MLS/Downloads/model-rf.pkl_826.npy',
 '/Users/MLS/Downloads/model-rf.pkl_827.npy',
 '/Users/MLS/Downloads/model-rf.pkl_828.npy',
 '/Users/MLS/Downloads/model-rf.pkl_829.npy',
 '/Users/MLS/Downloads/model-rf.pkl_830.npy',
 '/Users/MLS/Downloads/model-rf.pkl_831.npy',
 '/Users/MLS/Downloads/model-rf.pkl_832.npy',
 '/Users/MLS/Downloads/model-rf.pkl_833.npy',
 '/Users/MLS/Downloads/model-rf.pkl_834.npy',
 '/Users/MLS/Downloads/model-rf.pkl_835.npy',
 '/Users/MLS/Downloads/model-rf.pkl_836.npy',
 '/Users/MLS/Downloads/model-rf.pkl_837.npy',
 '/Users/MLS/Downloads/model-rf.pkl_838.npy',
 '/Users/MLS/Downloads/model-rf.pkl_839.npy',
 '/Users/MLS/Downloads/model-rf.pkl_840.npy',
 '/Users/MLS/Downloads/model-rf.pkl_841.npy',
 '/Users/MLS/Downloads/model-rf.pkl_842.npy',
 '/Users/MLS/Downloads/model-rf.pkl_843.npy',
 '/Users/MLS/Downloads/model-rf.pkl_844.npy',
 '/Users/MLS/Downloads/model-rf.pkl_845.npy',
 '/Users/MLS/Downloads/model-rf.pkl_846.npy',
 '/Users/MLS/Downloads/model-rf.pkl_847.npy',
 '/Users/MLS/Downloads/model-rf.pkl_848.npy',
 '/Users/MLS/Downloads/model-rf.pkl_849.npy',
 '/Users/MLS/Downloads/model-rf.pkl_850.npy',
 '/Users/MLS/Downloads/model-rf.pkl_851.npy',
 '/Users/MLS/Downloads/model-rf.pkl_852.npy',
 '/Users/MLS/Downloads/model-rf.pkl_853.npy',
 '/Users/MLS/Downloads/model-rf.pkl_854.npy',
 '/Users/MLS/Downloads/model-rf.pkl_855.npy',
 '/Users/MLS/Downloads/model-rf.pkl_856.npy',
 '/Users/MLS/Downloads/model-rf.pkl_857.npy',
 '/Users/MLS/Downloads/model-rf.pkl_858.npy',
 '/Users/MLS/Downloads/model-rf.pkl_859.npy',
 '/Users/MLS/Downloads/model-rf.pkl_860.npy',
 '/Users/MLS/Downloads/model-rf.pkl_861.npy',
 '/Users/MLS/Downloads/model-rf.pkl_862.npy',
 '/Users/MLS/Downloads/model-rf.pkl_863.npy',
 '/Users/MLS/Downloads/model-rf.pkl_864.npy',
 '/Users/MLS/Downloads/model-rf.pkl_865.npy',
 '/Users/MLS/Downloads/model-rf.pkl_866.npy',
 '/Users/MLS/Downloads/model-rf.pkl_867.npy',
 '/Users/MLS/Downloads/model-rf.pkl_868.npy',
 '/Users/MLS/Downloads/model-rf.pkl_869.npy',
 '/Users/MLS/Downloads/model-rf.pkl_870.npy',
 '/Users/MLS/Downloads/model-rf.pkl_871.npy',
 '/Users/MLS/Downloads/model-rf.pkl_872.npy',
 '/Users/MLS/Downloads/model-rf.pkl_873.npy',
 '/Users/MLS/Downloads/model-rf.pkl_874.npy',
 '/Users/MLS/Downloads/model-rf.pkl_875.npy',
 '/Users/MLS/Downloads/model-rf.pkl_876.npy',
 '/Users/MLS/Downloads/model-rf.pkl_877.npy',
 '/Users/MLS/Downloads/model-rf.pkl_878.npy',
 '/Users/MLS/Downloads/model-rf.pkl_879.npy',
 '/Users/MLS/Downloads/model-rf.pkl_880.npy',
 '/Users/MLS/Downloads/model-rf.pkl_881.npy',
 '/Users/MLS/Downloads/model-rf.pkl_882.npy',
 '/Users/MLS/Downloads/model-rf.pkl_883.npy',
 '/Users/MLS/Downloads/model-rf.pkl_884.npy',
 '/Users/MLS/Downloads/model-rf.pkl_885.npy',
 '/Users/MLS/Downloads/model-rf.pkl_886.npy',
 '/Users/MLS/Downloads/model-rf.pkl_887.npy',
 '/Users/MLS/Downloads/model-rf.pkl_888.npy',
 '/Users/MLS/Downloads/model-rf.pkl_889.npy',
 '/Users/MLS/Downloads/model-rf.pkl_890.npy',
 '/Users/MLS/Downloads/model-rf.pkl_891.npy',
 '/Users/MLS/Downloads/model-rf.pkl_892.npy',
 '/Users/MLS/Downloads/model-rf.pkl_893.npy',
 '/Users/MLS/Downloads/model-rf.pkl_894.npy',
 '/Users/MLS/Downloads/model-rf.pkl_895.npy',
 '/Users/MLS/Downloads/model-rf.pkl_896.npy',
 '/Users/MLS/Downloads/model-rf.pkl_897.npy',
 '/Users/MLS/Downloads/model-rf.pkl_898.npy',
 '/Users/MLS/Downloads/model-rf.pkl_899.npy',
 '/Users/MLS/Downloads/model-rf.pkl_900.npy',
 '/Users/MLS/Downloads/model-rf.pkl_901.npy',
 '/Users/MLS/Downloads/model-rf.pkl_902.npy',
 '/Users/MLS/Downloads/model-rf.pkl_903.npy',
 '/Users/MLS/Downloads/model-rf.pkl_904.npy',
 '/Users/MLS/Downloads/model-rf.pkl_905.npy',
 '/Users/MLS/Downloads/model-rf.pkl_906.npy',
 '/Users/MLS/Downloads/model-rf.pkl_907.npy',
 '/Users/MLS/Downloads/model-rf.pkl_908.npy',
 '/Users/MLS/Downloads/model-rf.pkl_909.npy',
 '/Users/MLS/Downloads/model-rf.pkl_910.npy',
 '/Users/MLS/Downloads/model-rf.pkl_911.npy',
 '/Users/MLS/Downloads/model-rf.pkl_912.npy',
 '/Users/MLS/Downloads/model-rf.pkl_913.npy',
 '/Users/MLS/Downloads/model-rf.pkl_914.npy',
 '/Users/MLS/Downloads/model-rf.pkl_915.npy',
 '/Users/MLS/Downloads/model-rf.pkl_916.npy',
 '/Users/MLS/Downloads/model-rf.pkl_917.npy',
 '/Users/MLS/Downloads/model-rf.pkl_918.npy',
 '/Users/MLS/Downloads/model-rf.pkl_919.npy',
 '/Users/MLS/Downloads/model-rf.pkl_920.npy',
 '/Users/MLS/Downloads/model-rf.pkl_921.npy',
 '/Users/MLS/Downloads/model-rf.pkl_922.npy',
 '/Users/MLS/Downloads/model-rf.pkl_923.npy',
 '/Users/MLS/Downloads/model-rf.pkl_924.npy',
 '/Users/MLS/Downloads/model-rf.pkl_925.npy',
 '/Users/MLS/Downloads/model-rf.pkl_926.npy',
 '/Users/MLS/Downloads/model-rf.pkl_927.npy',
 '/Users/MLS/Downloads/model-rf.pkl_928.npy',
 '/Users/MLS/Downloads/model-rf.pkl_929.npy',
 '/Users/MLS/Downloads/model-rf.pkl_930.npy',
 '/Users/MLS/Downloads/model-rf.pkl_931.npy',
 '/Users/MLS/Downloads/model-rf.pkl_932.npy',
 '/Users/MLS/Downloads/model-rf.pkl_933.npy',
 '/Users/MLS/Downloads/model-rf.pkl_934.npy',
 '/Users/MLS/Downloads/model-rf.pkl_935.npy',
 '/Users/MLS/Downloads/model-rf.pkl_936.npy',
 '/Users/MLS/Downloads/model-rf.pkl_937.npy',
 '/Users/MLS/Downloads/model-rf.pkl_938.npy',
 '/Users/MLS/Downloads/model-rf.pkl_939.npy',
 '/Users/MLS/Downloads/model-rf.pkl_940.npy',
 '/Users/MLS/Downloads/model-rf.pkl_941.npy',
 '/Users/MLS/Downloads/model-rf.pkl_942.npy',
 '/Users/MLS/Downloads/model-rf.pkl_943.npy',
 '/Users/MLS/Downloads/model-rf.pkl_944.npy',
 '/Users/MLS/Downloads/model-rf.pkl_945.npy',
 '/Users/MLS/Downloads/model-rf.pkl_946.npy',
 '/Users/MLS/Downloads/model-rf.pkl_947.npy',
 '/Users/MLS/Downloads/model-rf.pkl_948.npy',
 '/Users/MLS/Downloads/model-rf.pkl_949.npy',
 '/Users/MLS/Downloads/model-rf.pkl_950.npy',
 '/Users/MLS/Downloads/model-rf.pkl_951.npy',
 '/Users/MLS/Downloads/model-rf.pkl_952.npy',
 '/Users/MLS/Downloads/model-rf.pkl_953.npy',
 '/Users/MLS/Downloads/model-rf.pkl_954.npy',
 '/Users/MLS/Downloads/model-rf.pkl_955.npy',
 '/Users/MLS/Downloads/model-rf.pkl_956.npy',
 '/Users/MLS/Downloads/model-rf.pkl_957.npy',
 '/Users/MLS/Downloads/model-rf.pkl_958.npy',
 '/Users/MLS/Downloads/model-rf.pkl_959.npy',
 '/Users/MLS/Downloads/model-rf.pkl_960.npy',
 '/Users/MLS/Downloads/model-rf.pkl_961.npy',
 '/Users/MLS/Downloads/model-rf.pkl_962.npy',
 '/Users/MLS/Downloads/model-rf.pkl_963.npy',
 '/Users/MLS/Downloads/model-rf.pkl_964.npy',
 '/Users/MLS/Downloads/model-rf.pkl_965.npy',
 '/Users/MLS/Downloads/model-rf.pkl_966.npy',
 '/Users/MLS/Downloads/model-rf.pkl_967.npy',
 '/Users/MLS/Downloads/model-rf.pkl_968.npy',
 '/Users/MLS/Downloads/model-rf.pkl_969.npy',
 '/Users/MLS/Downloads/model-rf.pkl_970.npy',
 '/Users/MLS/Downloads/model-rf.pkl_971.npy',
 '/Users/MLS/Downloads/model-rf.pkl_972.npy',
 '/Users/MLS/Downloads/model-rf.pkl_973.npy',
 '/Users/MLS/Downloads/model-rf.pkl_974.npy',
 '/Users/MLS/Downloads/model-rf.pkl_975.npy',
 '/Users/MLS/Downloads/model-rf.pkl_976.npy',
 '/Users/MLS/Downloads/model-rf.pkl_977.npy',
 '/Users/MLS/Downloads/model-rf.pkl_978.npy',
 '/Users/MLS/Downloads/model-rf.pkl_979.npy',
 '/Users/MLS/Downloads/model-rf.pkl_980.npy',
 '/Users/MLS/Downloads/model-rf.pkl_981.npy',
 '/Users/MLS/Downloads/model-rf.pkl_982.npy',
 '/Users/MLS/Downloads/model-rf.pkl_983.npy',
 '/Users/MLS/Downloads/model-rf.pkl_984.npy',
 '/Users/MLS/Downloads/model-rf.pkl_985.npy',
 '/Users/MLS/Downloads/model-rf.pkl_986.npy',
 '/Users/MLS/Downloads/model-rf.pkl_987.npy',
 '/Users/MLS/Downloads/model-rf.pkl_988.npy',
 '/Users/MLS/Downloads/model-rf.pkl_989.npy',
 '/Users/MLS/Downloads/model-rf.pkl_990.npy',
 '/Users/MLS/Downloads/model-rf.pkl_991.npy',
 '/Users/MLS/Downloads/model-rf.pkl_992.npy',
 '/Users/MLS/Downloads/model-rf.pkl_993.npy',
 '/Users/MLS/Downloads/model-rf.pkl_994.npy',
 '/Users/MLS/Downloads/model-rf.pkl_995.npy',
 '/Users/MLS/Downloads/model-rf.pkl_996.npy',
 '/Users/MLS/Downloads/model-rf.pkl_997.npy',
 '/Users/MLS/Downloads/model-rf.pkl_998.npy',
 '/Users/MLS/Downloads/model-rf.pkl_999.npy',
 ...]

总结:

对于结构化数据进行机器学习一般步骤:

  • 1.从磁盘中读取原始数据,并进行数据备份(一般读取为dataframe数据结构)
  • 2.观察原始数据的属性代表的意思是什么,重点查看那些属性属于类别属性,那些属性属于数值型属性;(通过df.info()和df.describle()来查看)。
  • 3.对数据进行预处理
  • 缺失值处理:填充,删除和模型学习
  • 数值型数据:如果数据取值比其他属性大,或者属性内取值范围大,应该进行归一化处理;可以尝试转换为类别型的数据
  • 类别型的数据:进行one-hot处理
  • 4.特征工程
  • 特征生成:特征组合,特征提取
  • 特征筛选:嵌入型,包裹型,过滤型
  • 5.训练得到baseline
  • 6.模型状态评估
  • 7.模型优化:交叉验证,超参数选择等
  • 8.模型融合:基于上面优化的模型进行模型的融合操作。
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值