机器学习实战-titanc_demo02

Titanic Top 4% with ensemble modeling

13/07/2017
  • 1 Introduction
  • 2 Load and check data
    • 2.1 load data
    • 2.2 Outlier detection
    • 2.3 joining train and test set
    • 2.4 check for null and missing values
  • 3 Feature analysis
    • 3.1 Numerical values
    • 3.2 Categorical values
  • 4 Filling missing Values
    • 4.1 Age
  • 5 Feature engineering
    • 5.1 Name/Title
    • 5.2 Family Size
    • 5.3 Cabin
    • 5.4 Ticket
  • 6 Modeling
    • 6.1 Simple modeling
      • 6.1.1 Cross validate models
      • 6.1.2 Hyperparamater tunning for best models
      • 6.1.3 Plot learning curves
      • 6.1.4 Feature importance of the tree based classifiers
    • 6.2 Ensemble modeling
      • 6.2.1 Combining models
    • 6.3 Prediction
      • 6.3.1 Predict and Submit results

1. Introduction

This is my first kernel at Kaggle. I choosed the Titanic competition which is a good way to introduce feature engineering and ensemble modeling. Firstly, I will display some feature analyses then ill focus on the feature engineering. Last part concerns modeling and predicting the survival on the Titanic using an voting procedure.

This script follows three main parts:

  • Feature analysis
  • Feature engineering
  • Modeling
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from collections import Counter
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier, VotingClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold, learning_curve
sns.set(style="white",context="notebook",palette="deep")

2. Load and check data

2.1 Load data

# Load data
##### Load train and Test set

train = pd.read_csv("/work/johnson_folder/biggamesData/titanic/train.csv")
test = pd.read_csv("/work/johnson_folder/biggamesData/titanic/test.csv")
IDtest = test["PassengerId"]

2.2 检查异常值

#outlier detection
def detect_outlier(df,n,features):
    """
    Takes a dataframe df of features and returns a list of the indices corresponding to the observations containing more than n outliers according to the Tukey method
    """
    outlier_indices = []
    
    #iterate over features(columns)
    for col in features:
        #1st quartile(25%)
        Q1 = np.percentile(df[col],25)
        # 3rd quartile (75%)
        Q3 = np.percentile(df[col],75)
        # Interquartile range (IQR)
        IQR = Q3 - Q1
        # outlier step
        outlier_step = 1.5 * IQR
        # Determine a list of indices of outliers for feature col
        ourlier_list_col = df[(df[col]<Q1-outlier_step)|(df[col]>Q3+outlier_step)].index
        #append the found outlier indices for col to the list of outlier indices
        outlier_indices.extend(ourlier_list_col)
        
    #select observations containing more than 2 outliers
    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list(k for k,v in outlier_indices.items() if v>n)
    
    return multiple_outliers


#detect outliers from Age,SibSp,Parch and Fare
Outliers_to_drop = detect_outlier(train,2,["Age","SibSp","Parch","Fare"])

Since outliers can have a dramatic effect on the prediction (espacially for regression problems), i choosed to manage them.

I used the Tukey method (Tukey JW., 1977) to detect ouliers which defines an interquartile range comprised between the 1st and 3rd quartile of the distribution values (IQR). An outlier is a row that have a feature value outside the (IQR ± an outlier step).

I decided to detect outliers from the numerical values features (Age, SibSp, Sarch and Fare). Then, i considered outliers as rows that have at least two outlied numerical values.

train.loc[Outliers_to_drop] # Show the outliers rows
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
272801Fortune, Mr. Charles Alexandermale19.03219950263.00C23 C25 C27S
888911Fortune, Miss. Mabel Helenfemale23.03219950263.00C23 C25 C27S
15916003Sage, Master. Thomas HenrymaleNaN82CA. 234369.55NaNS
18018103Sage, Miss. Constance GladysfemaleNaN82CA. 234369.55NaNS
20120203Sage, Mr. FrederickmaleNaN82CA. 234369.55NaNS
32432503Sage, Mr. George John JrmaleNaN82CA. 234369.55NaNS
34134211Fortune, Miss. Alice Elizabethfemale24.03219950263.00C23 C25 C27S
79279303Sage, Miss. Stella AnnafemaleNaN82CA. 234369.55NaNS
84684703Sage, Mr. Douglas BullenmaleNaN82CA. 234369.55NaNS
86386403Sage, Miss. Dorothy Edith "Dolly"femaleNaN82CA. 234369.55NaNS
#删除异常值
train = train.drop(Outliers_to_drop,axis=0).reset_index(drop=True)

2.3 joining train and test

#join train and test datasets in order to obtain the same number of features during categorical conversion
train_len = len(train)
dataset = pd.concat([train,test],axis=0).reset_index(drop=True)

I join train and test datasets to obtain the same number of features during categorical conversions

2.4 检查空值和缺失值

#确实值填充
dataset = dataset.fillna(np.nan)

#检查缺失值的情况
dataset.isnull().sum()
Age             256
Cabin          1007
Embarked          2
Fare              1
Name              0
Parch             0
PassengerId       0
Pclass            0
Sex               0
SibSp             0
Survived        418
Ticket            0
dtype: int64
train.describe()
PassengerIdSurvivedPclassAgeSibSpParchFare
count881.000000881.000000881.000000711.000000881.000000881.000000881.000000
mean446.7139610.3859252.30760529.7316030.4551650.36322431.121566
std256.6170210.4870900.83505514.5478350.8715710.79183947.996249
min1.0000000.0000001.0000000.4200000.0000000.0000000.000000
25%226.0000000.0000002.00000020.2500000.0000000.0000007.895800
50%448.0000000.0000003.00000028.0000000.0000000.00000014.454200
75%668.0000001.0000003.00000038.0000001.0000000.00000030.500000
max891.0000001.0000003.00000080.0000005.0000006.000000512.329200

3. 特征分析

3.1 数值型特征

#相关系数矩阵
g = sns.heatmap(train[["Survived","SibSp","Parch","Age","Fare"]].corr(),annot=True, fmt = ".2f", cmap = "coolwarm")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-2E2jL8so-1581503030145)(output_17_0.png)]

从上面的热力图可以看出,只有Fare特征与Survive的有强烈的相关性,但是并不意味着其他的特征没有用。

#explore sibsp feature vs Survived 
g = sns.factorplot(x="SibSp",y="Survived",data=train,kind="bar", size = 6 , 
palette = "muted")
g.despine(left=True)
g = g.set_ylabels("survival probability")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-WzzztTeE-1581503030146)(output_19_0.png)]

It seems that passengers having a lot of siblings/spouses have less chance to survive

Single passengers (0 SibSP) or with two other persons (SibSP 1 or 2) have more chance to survive

This observation is quite interesting, we can consider a new feature describing these categories (See feature engineering)

Parch

# Explore Parch feature vs Survived
g  = sns.factorplot(x="Parch",y="Survived",data=train,kind="bar", size = 6 , 
palette = "muted")
g.despine(left=True)
g = g.set_ylabels("survival probability")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-MBk0EKNh-1581503030147)(output_22_0.png)]

Small families have more chance to survive, more than single (Parch 0), medium (Parch 3,4) and large families (Parch 5,6 ).

Be carefull there is an important standard deviation in the survival of passengers with 3 parents/children

#Age
g = sns.FacetGrid(train,col="Survived")
g = g.map(sns.distplot,"Age")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-EU4DVNiW-1581503030147)(output_24_0.png)]

Age distribution seems to be a tailed distribution, maybe a gaussian distribution.

We notice that age distributions are not the same in the survived and not survived subpopulations. Indeed, there is a peak corresponding to young passengers, that have survived. We also see that passengers between 60-80 have less survived.

So, even if “Age” is not correlated with “Survived”, we can see that there is age categories of passengers that of have more or less chance to survive.

It seems that very young passengers have more chance to survive.

#explore age distibution
g = sns.kdeplot(train["Age"][(train["Survived"] == 0) & (train["Age"].notnull())], color="Red", shade = True)
g = sns.kdeplot(train["Age"][(train["Survived"] == 1) & (train["Age"].notnull())], ax =g, color="Blue", shade= True)
g.set_xlabel("Age")
g.set_ylabel("Frequency")
g = g.legend(["Not Survived","Survived"])

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-RhYRW9Yh-1581503030147)(output_26_0.png)]

#Fill Fare missing values with the median value
dataset["Fare"] = dataset["Fare"].fillna(dataset["Fare"].median())
# Explore Fare distribution 
g = sns.distplot(dataset["Fare"], color="m", label="Skewness : %.2f"%(dataset["Fare"].skew()))
g = g.legend(loc="best")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bAxEmbp1-1581503030148)(output_28_0.png)]

#对Fare进行log处理
dataset['Fare'] = dataset['Fare'].map(lambda x:np.log(x) if x>0 else 0)
g = sns.distplot(dataset["Fare"], color="b", label="Skewness : %.2f"%(dataset["Fare"].skew()))
g = g.legend(loc="best")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-39XQEaKd-1581503030148)(output_30_0.png)]

3.2 categorical values

sex

g = sns.barplot(x="Sex",y="Survived",data=train)
g = g.set_ylabel("Survival prob")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JsmraNb2-1581503030148)(output_32_0.png)]

train[['Sex','Survived']].groupby("Sex").mean()
Survived
Sex
female0.747573
male0.190559

It is clearly obvious that Male have less chance to survive than Female.

So Sex, might play an important role in the prediction of the survival.

For those who have seen the Titanic movie (1997), I am sure, we all remember this sentence during the evacuation : “Women and children first”.

#pclass
g = sns.factorplot(x="Pclass",y="Survived",data=train,kind="bar",size=6,
palette = "muted")
g.despine(left=True)
g = g.set_ylabels("survival probability")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7OUu6VpS-1581503030149)(output_35_0.png)]

# Explore Pclass vs Survived by Sex
g = sns.factorplot(x="Pclass", y="Survived", hue="Sex", data=train,
                   size=6, kind="bar", palette="muted")
g.despine(left=True)
g = g.set_ylabels("survival probability")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-xEHJbA1t-1581503030149)(output_36_0.png)]

The passenger survival is not the same in the 3 classes. First class passengers have more chance to survive than second class and third class passengers.

This trend is conserved when we look at both male and female passengers.

## Embarked
dataset["Embarked"].isnull().sum()
2
#Fill Embarked nan values of dataset set with 'S' most frequent value
dataset["Embarked"] = dataset["Embarked"].fillna("S")
# Explore Embarked vs Survived 
g = sns.factorplot(x="Embarked", y="Survived",  data=train,
                   size=6, kind="bar", palette="muted")
g.despine(left=True)
g = g.set_ylabels("survival probability")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-OOgXWJRz-1581503030149)(output_40_0.png)]

It seems that passenger coming from Cherbourg © have more chance to survive.

My hypothesis is that the proportion of first class passengers is higher for those who came from Cherbourg than Queenstown (Q), Southampton (S).

Let’s see the Pclass distribution vs Embarked

# Explore Pclass vs Embarked 
g = sns.factorplot("Pclass", col="Embarked",  data=train,
                   size=6, kind="count", palette="muted")
g.despine(left=True)
g = g.set_ylabels("Count")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-b9JjJfIU-1581503030150)(output_42_0.png)]

Indeed, the third class is the most frequent for passenger coming from Southampton (S) and Queenstown (Q), whereas Cherbourg passengers are mostly in first class which have the highest survival rate.

At this point, i can’t explain why first class has an higher survival rate. My hypothesis is that first class passengers were prioritised during the evacuation due to their influence.

4. Filling missing Values

4.1 Age

As we see, Age column contains 256 missing values in the whole dataset.

Since there is subpopulations that have more chance to survive (children for example), it is preferable to keep the age feature and to impute the missing values.

To adress this problem, i looked at the most correlated features with Age (Sex, Parch , Pclass and SibSP).

# Explore Age vs Sex, Parch , Pclass and SibSP
g = sns.factorplot(y="Age",x="Sex",data=dataset,kind="box")
g = sns.factorplot(y="Age",x="Sex",hue="Pclass", data=dataset,kind="box")
g = sns.factorplot(y="Age",x="Parch", data=dataset,kind="box")
g = sns.factorplot(y="Age",x="SibSp", data=dataset,kind="box")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-VALOcJXY-1581503030150)(output_45_0.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-B83f3gqJ-1581503030150)(output_45_1.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ck62V3gv-1581503030151)(output_45_2.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-cYh0FPSA-1581503030151)(output_45_3.png)]

Age distribution seems to be the same in Male and Female subpopulations, so Sex is not informative to predict Age.

However, 1rst class passengers are older than 2nd class passengers who are also older than 3rd class passengers.

Moreover, the more a passenger has parents/children the older he is and the more a passenger has siblings/spouses the younger he is.

#convert sex into categorical value 0 for male 1 for female
dataset["Sex"] = dataset["Sex"].map({"male":0,"female":1})
g = sns.heatmap(dataset[["Age","Sex","SibSp","Parch","Pclass"]].corr(),cmap="BrBG",annot=True)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-nPPjXSj6-1581503030152)(output_48_0.png)]

The correlation map confirms the factorplots observations except for Parch. Age is not correlated with Sex, but is negatively correlated with Pclass, Parch and SibSp.

In the plot of Age in function of Parch, Age is growing with the number of parents / children. But the general correlation is negative.

So, i decided to use SibSP, Parch and Pclass in order to impute the missing ages.

The strategy is to fill Age with the median age of similar rows according to Pclass, Parch and SibSp.

# filling missing value of Age
## fill Age with the median age of similar rows according to Pclass,Parch and SibSp
# index of Nan age rows

index_NaN_age = list(dataset["Age"][dataset["Age"].isnull()].index)

for i in index_NaN_age :
    age_med = dataset["Age"].median()
    age_pred = dataset["Age"][((dataset['SibSp'] == dataset.iloc[i]["SibSp"]) & (dataset['Parch'] == dataset.iloc[i]["Parch"]) & (dataset['Pclass'] == dataset.iloc[i]["Pclass"]))].median()
    if not np.isnan(age_pred) :
        dataset['Age'].iloc[i] = age_pred
    else :
        dataset['Age'].iloc[i] = age_med

g = sns.factorplot(x="Survived", y = "Age",data = train, kind="box")
g = sns.factorplot(x="Survived", y = "Age",data = train, kind="violin")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-CwheaEgI-1581503030152)(output_51_0.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-tHI2FMyF-1581503030152)(output_51_1.png)]

5. Feature engineering

5.1 Name/Title

# get title from name
dataset_title = [i.split(",")[1].split(".")[0].strip() for i in dataset["Name"]]
dataset['Title'] = pd.Series(dataset_title)
dataset['Title'].head()
0      Mr
1     Mrs
2    Miss
3     Mrs
4      Mr
Name: Title, dtype: object
g = sns.countplot(x="Title",data=dataset)
g = plt.setp(g.get_xticklabels(), rotation=45) 

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-dpSxnXxo-1581503030153)(output_54_0.png)]

There is 17 titles in the dataset, most of them are very rare and we can group them in 4 categories.

#convert to categorical values Title 
dataset["Title"] = dataset["Title"].replace(['Lady', 'the Countess','Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
dataset["Title"] = dataset["Title"].map({"Master":0, "Miss":1, "Ms" : 1 , "Mme":1, "Mlle":1, "Mrs":1, "Mr":2, "Rare":3})
dataset["Title"] = dataset["Title"].astype(int)
g = sns.countplot(dataset["Title"])
g = g.set_xticklabels(["Master","Miss/Ms/Mme/Mlle/Mrs","Mr","Rare"])

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ZxYyDNkz-1581503030154)(output_57_0.png)]

g = sns.factorplot(x="Title",y="Survived",data=dataset,kind="bar")
g = g.set_xticklabels(["Master","Miss-Mrs","Mr","Rare"])
g = g.set_ylabels("survival probability")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-V5m3TeP1-1581503030154)(output_58_0.png)]

# Drop Name variable
dataset.drop(labels = ["Name"], axis = 1, inplace = True)

5.2 Family size

We can imagine that large families will have more difficulties to evacuate, looking for theirs sisters/brothers/parents during the evacuation. So, i choosed to create a “Fize” (family size) feature which is the sum of SibSp , Parch and 1 (including the passenger).

# Create a family size descriptor from SibSp and Parch
dataset["Fsize"] = dataset["SibSp"] + dataset["Parch"] + 1
g = sns.factorplot(x="Fsize",y="Survived",data = dataset)
g = g.set_ylabels("Survival Probability")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7wQCnGXc-1581503030155)(output_62_0.png)]

# Create new feature of family size
dataset['Single'] = dataset['Fsize'].map(lambda s: 1 if s == 1 else 0)
dataset['SmallF'] = dataset['Fsize'].map(lambda s: 1 if  s == 2  else 0)
dataset['MedF'] = dataset['Fsize'].map(lambda s: 1 if 3 <= s <= 4 else 0)
dataset['LargeF'] = dataset['Fsize'].map(lambda s: 1 if s >= 5 else 0)
g = sns.factorplot(x="Single",y="Survived",data=dataset,kind="bar")
g = g.set_ylabels("Survival Probability")
g = sns.factorplot(x="SmallF",y="Survived",data=dataset,kind="bar")
g = g.set_ylabels("Survival Probability")
g = sns.factorplot(x="MedF",y="Survived",data=dataset,kind="bar")
g = g.set_ylabels("Survival Probability")
g = sns.factorplot(x="LargeF",y="Survived",data=dataset,kind="bar")
g = g.set_ylabels("Survival Probability")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-AlzOPMXA-1581503030155)(output_64_0.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-leDshwhN-1581503030155)(output_64_1.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-J3OaEtWU-1581503030156)(output_64_2.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-b8RqgGDx-1581503030156)(output_64_3.png)]

# convert to indicator values Title and Embarked 
dataset = pd.get_dummies(dataset, columns = ["Title"])
dataset = pd.get_dummies(dataset, columns = ["Embarked"], prefix="Em")
dataset["Cabin"][dataset["Cabin"].notnull()].head()
1      C85
3     C123
6      E46
10      G6
11    C103
Name: Cabin, dtype: object
#replace the Cabin number by the type of cabin 'X' if not
dataset["Cabin"] = pd.Series(i[0] if not pd.isnull(i) else "X" for i in dataset["Cabin"])

g = sns.countplot(dataset["Cabin"],order=['A','B','C','D','E','F','G','T','X'])

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-b86UVTcH-1581503030156)(output_67_0.png)]

g = sns.factorplot(y="Survived",x="Cabin",data=dataset,kind="bar",order=['A','B','C','D','E','F','G','T','X'])
g = g.set_ylabels("Survival Probability")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-2NpVG9Xj-1581503030157)(output_68_0.png)]

dataset = pd.get_dummies(dataset, columns = ["Cabin"],prefix="Cabin")
# ticket
## Treat Ticket by extracting the ticket prefix. When there is no prefix it returns X. 

Ticket = []
for i in list(dataset.Ticket):
    if not i.isdigit() :
        Ticket.append(i.replace(".","").replace("/","").strip().split(' ')[0]) #Take prefix
    else:
        Ticket.append("X")
        
dataset["Ticket"] = Ticket
dataset["Ticket"].head()
0        A5
1        PC
2    STONO2
3         X
4         X
Name: Ticket, dtype: object
dataset = pd.get_dummies(dataset, columns = ["Ticket"], prefix="T")
# Create categorical values for Pclass
dataset["Pclass"] = dataset["Pclass"].astype("category")
dataset = pd.get_dummies(dataset, columns = ["Pclass"],prefix="Pc")
# Drop useless variables 
dataset.drop(labels = ["PassengerId"], axis = 1, inplace = True)

6.模型

## Separate train dataset and test dataset

train = dataset[:train_len]
test = dataset[train_len:]
test.drop(labels=["Survived"],axis = 1,inplace=True)
## Separate train features and label 

train["Survived"] = train["Survived"].astype(int)

Y_train = train["Survived"]

X_train = train.drop(labels = ["Survived"],axis = 1)

6.1 Simple modeling

6.1.1 Cross validate models

I compared 10 popular classifiers and evaluate the mean accuracy of each of them by a stratified kfold cross validation procedure.

  • SVC
  • Decision Tree
  • AdaBoost
  • Random Forest
  • Extra Trees
  • Gradient Boosting
  • Multiple layer perceprton (neural network)
  • KNN
  • Logistic regression
  • Linear Discriminant Analysis
# Cross validate model with Kfold stratified cross val
kfold = StratifiedKFold(n_splits=10)
# Modeling step Test differents algorithms 
random_state = 2
classifiers = []
classifiers.append(SVC(random_state=random_state))
classifiers.append(DecisionTreeClassifier(random_state=random_state))
classifiers.append(AdaBoostClassifier(DecisionTreeClassifier(random_state=random_state),random_state=random_state,learning_rate=0.1))
classifiers.append(RandomForestClassifier(random_state=random_state))
classifiers.append(ExtraTreesClassifier(random_state=random_state))
classifiers.append(GradientBoostingClassifier(random_state=random_state))
classifiers.append(MLPClassifier(random_state=random_state))
classifiers.append(KNeighborsClassifier())
classifiers.append(LogisticRegression(random_state = random_state))
classifiers.append(LinearDiscriminantAnalysis())

cv_results = []
for classifier in classifiers :
    cv_results.append(cross_val_score(classifier, X_train, y = Y_train, scoring = "accuracy", cv = kfold, n_jobs=4))

cv_means = []
cv_std = []
for cv_result in cv_results:
    cv_means.append(cv_result.mean())
    cv_std.append(cv_result.std())

cv_res = pd.DataFrame({"CrossValMeans":cv_means,"CrossValerrors": cv_std,"Algorithm":["SVC","DecisionTree","AdaBoost",
"RandomForest","ExtraTrees","GradientBoosting","MultipleLayerPerceptron","KNeighboors","LogisticRegression","LinearDiscriminantAnalysis"]})

g = sns.barplot("CrossValMeans","Algorithm",data = cv_res, palette="Set3",orient = "h",**{'xerr':cv_std})
g.set_xlabel("Mean Accuracy")
g = g.set_title("Cross validation scores")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ULWBJZCb-1581503030157)(output_79_0.png)]

6.1.2 Hyperparameter tunning for best models

I performed a grid search optimization for AdaBoost, ExtraTrees , RandomForest, GradientBoosting and SVC classifiers.

I set the “n_jobs” parameter to 4 since i have 4 cpu . The computation time is clearly reduced.

But be carefull, this step can take a long time, i took me 15 min in total on 4 cpu.

### META MODELING  WITH ADABOOST, RF, EXTRATREES and GRADIENTBOOSTING

# Adaboost
DTC = DecisionTreeClassifier()

adaDTC = AdaBoostClassifier(DTC, random_state=7)

ada_param_grid = {"base_estimator__criterion" : ["gini", "entropy"],
              "base_estimator__splitter" :   ["best", "random"],
              "algorithm" : ["SAMME","SAMME.R"],
              "n_estimators" :[1,2],
              "learning_rate":  [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3,1.5]}

gsadaDTC = GridSearchCV(adaDTC,param_grid = ada_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)

gsadaDTC.fit(X_train,Y_train)

ada_best = gsadaDTC.best_estimator_

Fitting 10 folds for each of 112 candidates, totalling 1120 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 348 tasks      | elapsed:    1.5s
[Parallel(n_jobs=4)]: Done 1120 out of 1120 | elapsed:    4.3s finished
#ExtraTrees 
ExtC = ExtraTreesClassifier()


## Search grid for optimal parameters
ex_param_grid = {"max_depth": [None],
              "max_features": [1, 3, 10],
              "min_samples_split": [2, 3, 10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [False],
              "n_estimators" :[100,300],
              "criterion": ["gini"]}


gsExtC = GridSearchCV(ExtC,param_grid = ex_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)

gsExtC.fit(X_train,Y_train)

ExtC_best = gsExtC.best_estimator_

# Best score
gsExtC.best_score_

Fitting 10 folds for each of 54 candidates, totalling 540 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    3.5s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:   11.3s
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:   28.4s
[Parallel(n_jobs=4)]: Done 540 out of 540 | elapsed:   34.1s finished





0.8308740068104427
# RFC Parameters tunning 
RFC = RandomForestClassifier()


## Search grid for optimal parameters
rf_param_grid = {"max_depth": [None],
              "max_features": [1, 3, 10],
              "min_samples_split": [2, 3, 10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [False],
              "n_estimators" :[100,300],
              "criterion": ["gini"]}


gsRFC = GridSearchCV(RFC,param_grid = rf_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)

gsRFC.fit(X_train,Y_train)

RFC_best = gsRFC.best_estimator_

# Best score
gsRFC.best_score_
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.


Fitting 10 folds for each of 54 candidates, totalling 540 fits


[Parallel(n_jobs=4)]: Done  76 tasks      | elapsed:    5.4s
[Parallel(n_jobs=4)]: Done 376 tasks      | elapsed:   21.7s
[Parallel(n_jobs=4)]: Done 540 out of 540 | elapsed:   33.2s finished





0.8365493757094211
# Gradient boosting tunning

GBC = GradientBoostingClassifier()
gb_param_grid = {'loss' : ["deviance"],
              'n_estimators' : [100,200,300],
              'learning_rate': [0.1, 0.05, 0.01],
              'max_depth': [4, 8],
              'min_samples_leaf': [100,150],
              'max_features': [0.3, 0.1] 
              }

gsGBC = GridSearchCV(GBC,param_grid = gb_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)

gsGBC.fit(X_train,Y_train)

GBC_best = gsGBC.best_estimator_

# Best score
gsGBC.best_score_

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.


Fitting 10 folds for each of 72 candidates, totalling 720 fits


[Parallel(n_jobs=4)]: Done  76 tasks      | elapsed:    2.8s
[Parallel(n_jobs=4)]: Done 376 tasks      | elapsed:   11.9s
[Parallel(n_jobs=4)]: Done 720 out of 720 | elapsed:   23.1s finished





0.8331441543700341
### SVC classifier
SVMC = SVC(probability=True)
svc_param_grid = {'kernel': ['rbf'], 
                  'gamma': [ 0.001, 0.01, 0.1, 1],
                  'C': [1, 10, 50, 100,200,300, 1000]}

gsSVMC = GridSearchCV(SVMC,param_grid = svc_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)

gsSVMC.fit(X_train,Y_train)

SVMC_best = gsSVMC.best_estimator_

# Best score
gsSVMC.best_score_
Fitting 10 folds for each of 28 candidates, totalling 280 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    4.0s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:   18.6s
[Parallel(n_jobs=4)]: Done 280 out of 280 | elapsed:   30.1s finished





0.8331441543700341

6.1.3画学习曲线

Learning curves are a good way to see the overfitting effect on the training set and the effect of the training size on the accuracy.

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)):
    """Generate a simple plot of the test and training learning curve"""
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

g = plot_learning_curve(gsRFC.best_estimator_,"RF mearning curves",X_train,Y_train,cv=kfold)
g = plot_learning_curve(gsExtC.best_estimator_,"ExtraTrees learning curves",X_train,Y_train,cv=kfold)
g = plot_learning_curve(gsSVMC.best_estimator_,"SVC learning curves",X_train,Y_train,cv=kfold)
g = plot_learning_curve(gsadaDTC.best_estimator_,"AdaBoost learning curves",X_train,Y_train,cv=kfold)
g = plot_learning_curve(gsGBC.best_estimator_,"GradientBoosting learning curves",X_train,Y_train,cv=kfold)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-tjHJXmEe-1581503030157)(output_87_0.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1yI4ZGBq-1581503030158)(output_87_1.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Y8uvDmam-1581503030158)(output_87_2.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-OzI1otxx-1581503030158)(output_87_3.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ag4f37Cq-1581503030159)(output_87_4.png)]

GradientBoosting and Adaboost classifiers tend to overfit the training set. According to the growing cross-validation curves GradientBoosting and Adaboost could perform better with more training examples.

SVC and ExtraTrees classifiers seem to better generalize the prediction since the training and cross-validation curves are close together.

6.1.4 Feature importance of tree based classifiers

In order to see the most informative features for the prediction of passengers survival, i displayed the feature importance for the 4 tree based classifiers.

nrows = ncols = 2
fig, axes = plt.subplots(nrows = nrows, ncols = ncols, sharex="all", figsize=(15,15))

names_classifiers = [("AdaBoosting", ada_best),("ExtraTrees",ExtC_best),("RandomForest",RFC_best),("GradientBoosting",GBC_best)]

nclassifier = 0
for row in range(nrows):
    for col in range(ncols):
        name = names_classifiers[nclassifier][0]
        classifier = names_classifiers[nclassifier][1]
        indices = np.argsort(classifier.feature_importances_)[::-1][:40]
        g = sns.barplot(y=X_train.columns[indices][:40],x = classifier.feature_importances_[indices][:40] , orient='h',ax=axes[row][col])
        g.set_xlabel("Relative importance",fontsize=12)
        g.set_ylabel("Features",fontsize=12)
        g.tick_params(labelsize=9)
        g.set_title(name + " feature importance")
        nclassifier += 1

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-465DD0wP-1581503030159)(output_90_0.png)]

I plot the feature importance for the 4 tree based classifiers (Adaboost, ExtraTrees, RandomForest and GradientBoosting).

We note that the four classifiers have different top features according to the relative importance. It means that their predictions are not based on the same features. Nevertheless, they share some common important features for the classification , for example ‘Fare’, ‘Title_2’, ‘Age’ and ‘Sex’.

Title_2 which indicates the Mrs/Mlle/Mme/Miss/Ms category is highly correlated with Sex.

We can say that:

  • Pc_1, Pc_2, Pc_3 and Fare refer to the general social standing of passengers.

  • Sex and Title_2 (Mrs/Mlle/Mme/Miss/Ms) and Title_3 (Mr) refer to the gender.

  • Age and Title_1 (Master) refer to the age of passengers.

  • Fsize, LargeF, MedF, Single refer to the size of the passenger family.

According to the feature importance of this 4 classifiers, the prediction of the survival seems to be more associated with the Age, the Sex, the family size and the social standing of the passengers more than the location in the boat.

test_Survived_RFC = pd.Series(RFC_best.predict(test), name="RFC")
test_Survived_ExtC = pd.Series(ExtC_best.predict(test), name="ExtC")
test_Survived_SVMC = pd.Series(SVMC_best.predict(test), name="SVC")
test_Survived_AdaC = pd.Series(ada_best.predict(test), name="Ada")
test_Survived_GBC = pd.Series(GBC_best.predict(test), name="GBC")


# Concatenate all classifier results
ensemble_results = pd.concat([test_Survived_RFC,test_Survived_ExtC,test_Survived_AdaC,test_Survived_GBC, test_Survived_SVMC],axis=1)


g= sns.heatmap(ensemble_results.corr(),annot=True)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-dHJQGlHY-1581503030159)(output_92_0.png)]

6.2 Ensemble modeling

6.2.1 Combining models

I choosed a voting classifier to combine the predictions coming from the 5 classifiers.

I preferred to pass the argument “soft” to the voting parameter to take into account the probability of each vote.

votingC = VotingClassifier(estimators=[('rfc', RFC_best), ('extc', ExtC_best),
('svc', SVMC_best), ('adac',ada_best),('gbc',GBC_best)], voting='soft', n_jobs=4)

votingC = votingC.fit(X_train, Y_train)

6.3 Prediction

6.3.1 Predict and Submit results
test_Survived = pd.Series(votingC.predict(test), name="Survived")

results = pd.concat([IDtest,test_Survived],axis=1)

results.to_csv("ensemble_python_voting.csv",index=False)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值