Titanic Kaggle学习笔记

三不傻勇闯kaggle

于 2024-05-22 13:00:44 发布

阅读量673

点赞数 16

文章标签：学习笔记

本文链接：https://blog.csdn.net/2301_81087442/article/details/138622816

版权

前言

本文主要是笔者的个人学习笔记，学习内容的相关链接：https://github.com/ahmedbesbes/How-to-score-0.8134-in-Titanic-Kaggle-Challenge/blob/master/article_1.ipynb。

本文的运行环境为Kaggle上的notebook。已经经过了测试。大量与以上链接重复的代码是为了解决因Python版本更迭导致部分代码报错的问题，此版代码有助于新手学习。如果有侵权请及时与我联系，我会及时删除。

此外，本文还对这些代码中的一些知识进行了总结。

一、导入相关库并读取数据

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

from IPython.core.display import HTML
HTML("""
<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}
</style>
""");



%matplotlib inline

import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings('ignore', category=DeprecationWarning)

import pandas as pd
pd.options.display.max_columns = 100

from matplotlib import pyplot as plt
import numpy as np

import seaborn as sns

import pylab as plot
params = { 
    'axes.labelsize': "large",
    'xtick.labelsize': 'x-large',
    'legend.fontsize': 20,
    'figure.dpi': 150,
    'figure.figsize': [25, 7]
}
plot.rcParams.update(params)

二、数据的探索性分析

print(data.shape) # 或者直接data.shape，也可以显示出结果。

#   output   
#  (891, 12)

data.head()

在这里插入图片描述
对于这个数据集的每一列，原文和官方都进行了详细的描述，本文只是将其翻译为中文再复述一遍。推荐大家去看原文，如果不愿意读英文，也可以参考本文。

Survived 列是目标变量。如果 Suvival = 1，则乘客幸存，否则他就死了。这是我们要预测的变量。

其他变量描述乘客。它们就是特征。

PassengerId：船上每个旅客的 ID
Pclass：旅客舱位。它有三个可能的值：1、2、3（一等、二等和三等）
The Name of the passeger
The Sex
The Age
SibSp：与乘客同行的兄弟姐妹和配偶的数量
Parch：与乘客同行的父母和儿童的数量
The ticket number
The ticket Fare
The cabin number
The embarkation。这描述了泰坦尼克号上人们可能登船的三个区域。三个可能值 S、C、Q

Pandas 允许您对数字特征进行高级简单的统计描述。这可以使用describe()方法来完成。

data.describe()

在这里插入图片描述
可以看到Age列有缺失值，暂时用中位数填补它。

data['Age'] = data['Age'].fillna(data['Age'].median())
data.describe()

在这里插入图片描述
没有缺失值了。

下一步通过一些图表来直观地感受各个特征与目标变量的关系。

2.1 Sex

data.groupby('Sex').agg('sum')[['Survived', 'Died']].plot(kind='bar', figsize=(25, 7),
                                                          stacked=True, color=['g', 'r']);

在这里插入图片描述

data.groupby('Sex')[['Survived', 'Died']].agg('mean').plot(kind='bar', figsize=(25, 7), 
                                                           stacked=True, color=['g', 'r']);

在这里插入图片描述
可以看出来，女性更容易生存。

2.2 Sex & Age

为了探究这两个变量组合起来的关系，作者绘制了小提琴图。

fig = plt.figure(figsize=(25, 7))
sns.violinplot(x='Sex', y='Age', 
               hue='Survived', data=data, 
               split=True,
               palette={0: "r", 1: "g"}
              )

在这里插入图片描述
关于小提琴图怎么看的问题，可以参考这个链接：https://www.cnblogs.com/metafullstack/p/17658735.html。简单来说就是用核密度函数（KDE, Kernel Density Estimation）来估计样本的分布情况，绘制出概率密度图。再绘制箱线图嵌入到里面。

从图中可以得出以下结论：

女性比男性存活率更高，如较大的女性绿色概率密度图所示。
年龄决定了男性乘客的生存
– 年轻男性往往能生存
– 大量20至40岁的男性死亡。
年龄对女性生存没有直接影响

2.3 Fare

figure = plt.figure(figsize=(25, 7))
plt.hist([data[data['Survived'] == 1]['Fare'], data[data['Survived'] == 0]['Fare']], 
         stacked=True, color = ['g','r'],
         bins = 50, label = ['Survived','Dead'])
plt.xlabel('Fare')
plt.ylabel('Number of passengers')
plt.legend()

在这里插入图片描述

可以看出，支付票价越贵，社会地位越高的人，越可能生存下来。

2.4 Age & Fare

plt.figure(figsize=(25, 7))
ax = plt.subplot()

ax.scatter(data[data['Survived'] == 1]['Age'], data[data['Survived'] == 1]['Fare'], 
           c='green', s=data[data['Survived'] == 1]['Fare'])
ax.scatter(data[data['Survived'] == 0]['Age'], data[data['Survived'] == 0]['Fare'], 
           c='red', s=data[data['Survived'] == 0]['Fare']);

在这里插入图片描述
这只是一个简单的散点图，横坐标是Age，纵坐标是Fare。至于为什么说泡泡有大有小，是因为，我们把散点的大小设置成了Fare。也就是scatter的第3个参数。

绿色的点是生存下来的，红色的点是死了的。可以看出，除了x = 0到x = 7之间的孩子，基本上都是票价越高的，越可能生存下来。

2.5 Pclass & Fare

作者通过直方图简单的探究了一下这两个变量之间的相关性。

ax = plt.subplot()
ax.set_ylabel('Average fare')
data.groupby('Pclass')['Fare'].mean().plot(kind='bar', figsize=(25, 7), ax = ax)

2.6 Embark

fig = plt.figure(figsize=(25, 7))
sns.violinplot(x='Embarked', y='Fare', hue='Survived', data=data, split=True, palette={0: "r", 1: "g"})

在这里插入图片描述
港口C和S支付的票价是最高的。它们似乎更容易生存，而港口Q相对较低。

三、特征工程

这一部分作者有一个我认为很好的习惯，就是勤写函数。这让代码简洁易懂了许多。还有一个优点是，它把训练集和测试集拼在一起处理，我第一次建模的时候就没有做到这点，，所以代码重复率很高而且还可能出问题。

def status(feature):
    print 'Processing', feature, ': ok'

status函数用来标志某预处理函数的完成。

def get_combined_data():
    # reading train data
    train = pd.read_csv('/kaggle/input/titanic/train.csv')
    
    # reading test data
    test = pd.read_csv('/kaggle/input/titanic/test.csv')

    # extracting and then removing the targets from the training data 
    targets = train.Survived
    train.drop(['Survived'], axis = 1, inplace=True)
    

    # merging train data and test data for future feature engineering
    # we'll also remove the PassengerID since this is not an informative feature
    combined = pd.concat([train, test], axis = 0)
    combined.reset_index(inplace=True)
    combined.drop(['index', 'PassengerId'], inplace=True, axis=1)
    
    return combined
combined = get_combined_data()
print(combined.shape)
print(combined.head())

get_combined_data()就是将训练集和测试集拼接并返回的函数。输出结果就不展示了，自己运行一遍应该是没问题的，有问题可以私我。

3.1 Extracting the passenger titles

如果我们仔细观察Name列，会发现其中有诸如Mr、Miss之类的称呼。在英文习惯里，其往往能够一定程度上体现这个人的年龄与社会地位，因此，我们希望把它提取出来。首先，看看一共有哪些Title。

titles = set()
for name in data['Name']:
    titles.add(name.split(',')[1].split('.')[0].strip())
titles

在这里插入图片描述

接下来，将一些含义相近的Title分到同一组。

Title_Dictionary = {
    "Capt": "Officer",
    "Col": "Officer",
    "Major": "Officer",
    "Jonkheer": "Royalty",
    "Don": "Royalty",
    "Sir" : "Royalty",
    "Dr": "Officer",
    "Rev": "Officer",
    "the Countess":"Royalty",
    "Mme": "Mrs",
    "Mlle": "Miss",
    "Ms": "Mrs",
    "Mr" : "Mr",
    "Mrs" : "Mrs",
    "Miss" : "Miss",
    "Master" : "Master",
    "Lady" : "Royalty"
}

def get_titles():
    # we extract the title from each name
    combined['Title'] = combined['Name'].map(lambda name:name.split(',')[1].split('.')[0].strip())
    
    # a map of more aggregated title
    # we map each title
    combined['Title'] = combined.Title.map(Title_Dictionary)
    status('Title')
    return combined
combined = get_titles()
combined.head()

在这里插入图片描述
作者还进行了一下检查

combined[combined['Title'].isnull()]

在这里插入图片描述
发现有一个空值。但因为这个Title是在测试集中的，在训练集没有出现过，所以不需要对它进行额外的处理。否则会造成信息泄露。

提取Title的过程就结束了。上面的代码中如果Python没学好可能会有疑惑。这里贴一下map函数的相关说明吧。其他的疑惑可以自己去查查。

3.2 Processing the ages

相比我一开始尝试自己做的时候直接用均值填补年龄的方法，作者采取的方法更为科学，即根据Sex、Title和Pclass的分组中位数来填补年龄。首先观察一下缺失值的数量。

print(combined.iloc[:891].Age.isnull().sum())
print(combined.iloc[891:].Age.isnull().sum())

在这里插入图片描述

然后分组并根据分组计算年龄中位数。

grouped_train = combined.iloc[:891].groupby(['Sex','Pclass','Title'])
grouped_median_train = grouped_train[['Age']].median()
grouped_median_train = grouped_median_train.reset_index()[['Sex', 'Pclass', 'Title', 'Age']]
grouped_median_train

在这里插入图片描述
开始填补缺失值。

def fill_age(row):
    condition = (
        (grouped_median_train['Sex'] == row['Sex']) & 
        (grouped_median_train['Title'] == row['Title']) & 
        (grouped_median_train['Pclass'] == row['Pclass'])
    ) 
    return grouped_median_train[condition]['Age'].values[0]

def process_age():
    global combined
    # a function that fills the missing values of the Age variable
    combined['Age'] = combined.apply(lambda row: fill_age(row) if np.isnan(row['Age']) else row['Age'], axis=1)
    status('age')
    return combined
    
combined = process_age()

年龄的预处理部分就结束了。

3.3 Processiong the name

由于提取了Title，从名称中有效的信息就已经全部被提取出来了，因此作者选择直接删除name列。然后根据Title进行独热编码。还贴出了编码的官方教程。

事实上，根据我读的其他文章，name里的姓或许也是重要的信息，不过没关系，我们先按这篇文章的思路走。

def process_names():
    global combined
    # we clean the Name variable
    combined.drop('Name', axis=1, inplace=True)
    
    # encoding in dummy variable
    titles_dummies = pd.get_dummies(combined['Title'], prefix='Title')
    combined = pd.concat([combined, titles_dummies], axis=1)
    
    # removing the title variable
    combined.drop('Title', axis=1, inplace=True)
    
    status('names')
    return combined
combined = process_names()
combined.head()

在这里插入图片描述

这篇文章采用的编码方式是独热编码。但是其实编码方式也有很多，也许换个编码方式能让最终结果更好一些？我也不确定。在这里只是提一嘴。

3.4 Processing Fare

def process_fares():
    global combined
    # there's one missing fare value - replacing it with the mean.
    combined.Fare.fillna(combined.iloc[:891].Fare.mean(),inplace=True)
    status('fare')
    return combined
combined = process_fares()

3.5 Processing Embarked

def process_embarked():
    global combined
    # two missing embarked values - filling them with the most frequent one in the train  set(S)
    combined.Embarked.fillna('S', inplace=True)
    # dummy encoding 
    embarked_dummies = pd.get_dummies(combined['Embarked'], prefix='Embarked')
    combined = pd.concat([combined, embarked_dummies], axis=1)
    combined.drop('Embarked', axis=1, inplace=True)
    status('embarked')
    return combined
combined = process_embarked()
combined.head()

在这里插入图片描述

3.6 Processing Cabin

我一开始做这个的时候，以为Cabin是没有用的信息，因为缺失值实在太多，直接把这一列删了。但是在学习过程中我发现，这个值缺失是有特殊含义的，很可能是Cabin没有进行登记，所以作者的处理方法更加合理。

train_cabin, test_cabin = set(), set()

for c in combined.iloc[:891]['Cabin']:
    try:
        train_cabin.add(c[0])
    except:
        train_cabin.add('U')
        
for c in combined.iloc[891:]['Cabin']:
    try:
        test_cabin.add(c[0])
    except:
        test_cabin.add('U')
train_cabin

在这里插入图片描述

test_cabin

在这里插入图片描述
很庆幸，测试集里并没有比训练集多出来数据类型。接下来开始处理Cabin。

def process_cabin():
    global combined    
    # replacing missing cabins with U (for Uknown)
    combined.Cabin.fillna('U', inplace=True)
    
    # mapping each Cabin value with the cabin letter
    combined['Cabin'] = combined['Cabin'].map(lambda c: c[0])
    
    # dummy encoding ...
    cabin_dummies = pd.get_dummies(combined['Cabin'], prefix='Cabin')    
    combined = pd.concat([combined, cabin_dummies], axis=1)

    combined.drop('Cabin', axis=1, inplace=True)
    status('cabin')
    return combined
combined = process_cabin()
combined.head()

在这里插入图片描述

3.7 Processing Sex

def process_sex():
    global combined
    # mapping string values to numerical one 
    combined['Sex'] = combined['Sex'].map({'male':1, 'female':0})
    status('Sex')
    return combined
combined = process_sex()

其实就是独热编码……

3.8 Processing Pclass

def process_pclass():
    
    global combined
    # encoding into 3 categories:
    pclass_dummies = pd.get_dummies(combined['Pclass'], prefix="Pclass")
    
    # adding dummy variable
    combined = pd.concat([combined, pclass_dummies],axis=1)
    
    # removing "Pclass"
    combined.drop('Pclass',axis=1,inplace=True)
    
    status('Pclass')
    return combined
combined = process_pclass()

3.9 Processiong Ticket

def cleanTicket(ticket):
    ticket = ticket.replace('.', '')
    ticket = ticket.replace('/', '')
    ticket = ticket.split()
    ticket = map(lambda t : t.strip(), ticket)
    ticket = list(filter(lambda t : not t.isdigit(), ticket))
    if len(ticket) > 0:
        return ticket[0]
    else: 
        return 'XXX'

这段代码……额代码能看懂，逻辑上不懂，我先想想

tickets = set()
for t in combined['Ticket']:
   tickets.add(cleanTicket(t))

len(tickets)

# output 
# 37

def process_ticket():
    
    global combined
    
    # a function that extracts each prefix of the ticket, returns 'XXX' if no prefix (i.e the ticket is a digit)
    def cleanTicket(ticket):
        ticket = ticket.replace('.','')
        ticket = ticket.replace('/','')
        ticket = ticket.split()
        ticket = map(lambda t : t.strip(), ticket)
        ticket = list(filter(lambda t : not t.isdigit(), ticket))
        if len(ticket) > 0:
            return ticket[0]
        else: 
            return 'XXX'
    

    # Extracting dummy variables from tickets:

    combined['Ticket'] = combined['Ticket'].map(cleanTicket)
    tickets_dummies = pd.get_dummies(combined['Ticket'], prefix='Ticket')
    combined = pd.concat([combined, tickets_dummies], axis=1)
    combined.drop('Ticket', inplace=True, axis=1)

    status('Ticket')
    return combined
combined = process_ticket()

3.10 Processing Family

def process_family():
    
    global combined
    # introducing a new feature : the size of families (including the passenger)
    combined['FamilySize'] = combined['Parch'] + combined['SibSp'] + 1
    
    # introducing other features based on the family size
    combined['Singleton'] = combined['FamilySize'].map(lambda s: 1 if s == 1 else 0)
    combined['SmallFamily'] = combined['FamilySize'].map(lambda s: 1 if 2 <= s <= 4 else 0)
    combined['LargeFamily'] = combined['FamilySize'].map(lambda s: 1 if 5 <= s else 0)
    
    status('family')
    return combined
combined = process_family()
combined.shape

# ouput 
# (1309, 67)

combined.head()

在这里插入图片描述

四、建模

在这一部分，我们要做的事有：

将我们原来合并的数据集再次拆分为训练集和测试集。
使用训练集构建模型。
使用训练集评估模型
最后用所有训练集来建模，并对测试集数据进行预测、提交。

2、3要不断重复直到得到满意的分数。

实际上，有些时候模型效果不好并不是模型本身的问题，而是特征工程出了问题。作者的意思可能是说2、3要重复多次吧。如果2、3重复了很多次得不到满意的效果，那么可以在特征工程方面再做一些额外的工作。

导入库

from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble.gradient_boosting import GradientBoostingClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV

定义一个评分函数用来评估模型。

def compute_score(clf, X, y, scoring='accuracy'):
    return cross_val_score(clf, X, y, cv = 5, scoring=scoring).mean()

这里的X是我们的解释变量，y是被解释变量，clf是我们定义的模型，进行5折交叉验证。scoring是评估指标的选择，默认为准确率。

接下来将训练集和测试集分离。

def recover_train_test_target():
    global combined
    
    targets = pd.read_csv('/kaggle/input/titanic/train.csv', usecols=['Survived'])['Survived'].values
    train = combined.iloc[:891]
    test = combined.iloc[891:]
    
    return train, test, targets
train, test, targets = recover_train_test_target()

4.1 特征选择

我们现在有30多个特征，根据作者说的，特征选择有以下3个好处。

减少数据之间的冗余
加快模型训练速度
减少过拟合

原先学识浅薄的我一直不知道该怎样降维，于是总是诉诸主成分分析、因子分析等方法。其实这种特征选择也是可以起到降维的效果的。

clf = RandomForestClassifier(n_estimators=50, max_features='sqrt')
clf = clf.fit(train, targets)

这里构建了一个随机森林模型，这个模型建立完以后有一个feature_importances_属性，它可以反映各个特征的重要性。

features = pd.DataFrame()
features['feature'] = train.columns
features['importance'] = clf.feature_importances_
features.sort_values(by=['importance'], ascending=True, inplace=True)
features.set_index('feature', inplace=True)
features.plot(kind='barh', figsize=(25, 25))

在这里插入图片描述

model = SelectFromModel(clf, prefit=True)
train_reduced = model.transform(train)
test_reduced = model.transform(test)

选择特征并且将原数据集转换。

logreg = LogisticRegression()
logreg_cv = LogisticRegressionCV()
rf = RandomForestClassifier()
gboost = GradientBoostingClassifier()

models = [logreg, logreg_cv, rf, gboost]

for model in models:
    print ('Cross-validation of : {0}'.format(model.__class__))
    score = compute_score(clf=model, X=train_reduced, y=targets, scoring='accuracy')
    print ('CV score = {0}'.format(score))
    print ('****')

根据模型的评分来选择模型。。额，虽然gboost评分更高，但是作者还是用了随机森林。。所以我们其实也可以用其他模型的。em，包括这上面没有的模型。

# turn run_gs to True if you want to run the gridsearch again.
run_gs = True

if run_gs:
    parameter_grid = {
                 'max_depth' : [4, 6, 8],
                 'n_estimators': [50, 10],
                 'max_features': ['sqrt', 'auto', 'log2'],
                 'min_samples_split': [2, 3, 10],
                 'min_samples_leaf': [1, 3, 10],
                 'bootstrap': [True, False],
                 }
    forest = RandomForestClassifier()
    cross_validation = StratifiedKFold(n_splits=5)

    grid_search = GridSearchCV(forest,
                               scoring='accuracy',
                               param_grid=parameter_grid,
                               cv=cross_validation,
                               verbose=1
                              )

    grid_search.fit(train, targets)
    model = grid_search
    parameters = grid_search.best_params_

    print('Best score: {}'.format(grid_search.best_score_))
    print('Best parameters: {}'.format(grid_search.best_params_))
parameters = grid_search.best_params_
    
model = RandomForestClassifier(**parameters)
model.fit(train, targets)

这里用网格搜索的方法进行调参，并用最后的最优参数进行建模。作者的原先代码需要运行两次，第一次把最优参数找出来，第二次把最优参数写上去，改run_gs为False并运行。我这里直接将最优参数保存下来，并且训练模型了。比较省事儿。

output = model.predict(test).astype(int)
df_output = pd.DataFrame()
aux = pd.read_csv('/kaggle/input/titanic/test.csv')
df_output['PassengerId'] = aux['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv('gridsearch_rf.csv', index=False)