泰坦尼克号求生预测


运行环境说明

Equipment environment:
system: Win10 64
python version: 3.7.4
matplotlib version: 3.1.1
numpy version: 1.16.5
sklearn version: 0.21.3
pandas version: 0.25.1
seaborn version: 0.9.0
collections version:

对于Titanic获救概率进行分析

1. 数据分析

1.1 查看数据信息

print(data_train.info())

从数据信息中可知:

  1. 'Age ', ‘Cabin’, ‘Embarked’ 这三个特征有数据缺失, 'Cabin’只剩20%左右的数据;
  2. ‘Name’, ‘Sex’, ‘Ticket’,‘Cabin’, ‘Embarked’ 这几个特征类型为 ‘object’, 需要做编码处理以使起能加入计算;
    RangeIndex: 891 entries, 0 to 890
    Data columns (total 12 columns):
    PassengerId 891 non-null int64
    Survived 891 non-null int64
    Pclass 891 non-null int64
    Name 891 non-null object
    Sex 891 non-null object
    Age 714 non-null float64
    SibSp 891 non-null int64
    Parch 891 non-null int64
    Ticket 891 non-null object
    Fare 891 non-null float64
    Cabin 204 non-null object
    Embarked 889 non-null object
    dtypes: float64(2), int64(5), object(5)
pd.set_option('display.max_rows',None)
print(data_train.describe())

从数据上看, 数字如年龄, 票价这些量纲不一致,在涉及到与距离有关的模型,如果不做无量纲处理,会导致收敛变慢或者无法收敛
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
‘’’

查看特征之间的相关系数
从相关系数看相对来说SibSp, Parch 相关新在所有变量之间稍高,可以尝试去除其中一个看效果,或者两个相加作为一个新特征及两两变量之间的相关系数, 为1表示两变量线性相关,如果两变量无关系为0, 只为0或者数字较小不能说明不相关

view_heatmap()

在这里插入图片描述

2. 数据清洗

2.1 空值处理

从上可知只有’Age ', ‘Cabin’, ‘Embarked’ 这三个特征有数据缺失, 其中’Cabin’只剩20%左右的数据,直接丢弃

2.1.1 Embarked空值处理

print(Counter(df['Embarked']))

Counter({‘S’: 644, ‘C’: 168, ‘Q’: 77, nan: 2})
从结果看’Q’ 的比较少,把两个缺失值就直接填充为Q了

data_train['Embarked'] = data_train['Embarked'].fillna('Q')

2.1.2 Age空值处理

通过该函数view_age_group_survived()产看Age年龄段存活率的情况。 (具体代码请看代码区)
从图中分布看年龄小的和年龄大的容易获救. 0位置处为年龄缺失的总人数和存活率
考虑到每个年龄画柱状图会比较不容易观察,所以以10岁为一个年龄段来画图查看,不影响分析
在这里插入图片描述

  • 在通过查看后可把年龄分段处理,可考虑以10岁或者15,20岁为一个年龄段标示, 当前选择10为一个年龄段,对于nan的数据采用模型来预测处于什么年龄段
# 对Age 做特征工程, 把年龄分段,并用模型预测丢失的年龄
def age_feature_engineer():
    data_train['age_group'] = data_train['Age'].apply(lambda x: int(x // 10 + 1) if pd.notnull(x) else x)
    data_use_age = data_train.filter(regex='age_group|Survived|SibSp|Parch|Fare|Embarked_.*|Sex_.*|Pclass_.*|name_title_.*|Ticket_.*')
    train_age_known = data_use_age.loc[data_use_age.age_group.notnull()]
    predict_age_unknown = data_use_age.loc[data_use_age.age_group.isnull()]

    x_train, age = train_age_known.drop('age_group', axis=1), train_age_known['age_group']
    x_test = predict_age_unknown.drop('age_group', axis=1)

    mode_predict_age = RandomForestClassifier(random_state=0, n_estimators=30, n_jobs=-1)
    mode_predict_age.fit(x_train, age)
    
    predict_ages = mode_predict_age.predict(x_test)
    data_train.loc[data_use_age.age_group.isnull(), 'age_group'] = predict_ages

2.2 对字类特征做分组

2.2.1对Name做分组

国外从称呼上可以大概判断该人时属于政府人员,皇室…还是普通人, 因此先把称呼提取出来

# 也可用该操作替代获取称呼 data_train['name_title'] = data_train['Name'].apply(lambda x: x.split(',')[1].split('.')[0].strip())
data_train['name_title'] = data_train['Name'].str.extract('.+,(.+)', expand=False).str.extract('^(.+?)\.', expand=False).str.strip()
data_train['name_title'].replace(['Capt', 'Col', 'Major', 'Dr', 'Rev'], 'Officer', inplace=True)
data_train['name_title'].replace(['Jonkheer', 'Don', 'Sir', 'the Countess', 'Dona', 'Lady'], 'Royalty' , inplace=True)
data_train['name_title'].replace(['Mme', 'Ms', 'Mrs'], 'Mrs', inplace=True)
data_train['name_title'].replace(['Mlle', 'Miss'], 'Miss', inplace=True)
data_train['name_title'].replace(['Mr'], 'Mr' , inplace=True)
data_train['name_title'].replace(['Master'], 'Master', inplace=True)

2.2.2 对Ticket做分组

看着数据,Ticket列里每个元素的首个字符应该可以判断票号类别

# Ticket 的数据处理->取出首字符
data_train['Ticket'] = data_train['Ticket'].str[0]

2.3 对定性数据做one-Hot

  • 对字符类数据分组后,可以查看各组对应的存活率。 通过object_analysis() 函数可查看,具体请看代码区

从图中看称呼,性别,船票号和登船港口都有偏重在某个信息下获救概率大,所以认为这些信息都重要,从图中据得船票号还可以做下处理, 把比较高的, 次高的,低的分为三类(目前该项目没这样处理)
在这里插入图片描述

把上面分析到都挺重要的数据做编码处理(此处选择独热编码)

# 标记需要one-Hot 的特征
need_one_hot_feature = ['name_title', 'Sex', 'Ticket', 'Embarked', 'age_group']
# 在分析 'name_title', 'Sex', 'Ticket', 'Embarked' 特种重要后,需要做进一步的处理(如在使用的模型与距离相关,
# 做独热编码处理;如果是树模型,序号编码是最好的(省存储)) 'age_group' 考虑到年龄段少就做一下onr-Hot, 不做应该也行,看实验结果
def one_hot_hander(feature_list):
    one_hot_feature = []
    for feature_name in feature_list:
        one_hot_feature.append(pd.get_dummies(data_train[feature_name], prefix= feature_name))
    return one_hot_feature

2.4 特征扩展

在数据分析中得到SibSp,Parch的相关系数情况,可对两个数据做和来扩展特征,后面视实验情况决定是否要替代SibSp,Parch这两个特征

data_train['SibSp_Parch'] = data_train['SibSp'] + data_train['Parch']
one_hot_feature = one_hot_hander(need_one_hot_feature)
# 特征扩展, 把onehont后的数据加入
data_train = pd.concat([data_train]+one_hot_feature, axis=1)

2.5 无量纲化

从data_train.describe()中可知’Fare’, ‘Pclass’, ‘Parch’, 'SibSp_Parch各特征宏数据相差几十倍,所以用无量纲化

# 标记需要做无量纲化的数据
need_dimensionless_feature = ['Fare', 'Pclass', 'Parch', 'SibSp_Parch']
# 对需要做无量纲的特征做处理
def dimensionless_processing(feature):
    for feature_name in feature:
        scaler = Normalizer()
        data_train[feature_name] = (scaler.fit_transform([data_train[feature_name]])).T

3. 模型预测

搜索最好的树深度 (代码见tree_depth_acc_relaption())
** 从结果看出, 树深度过深会引起过拟合,从图中观察,树深度可选4~9 均可
在这里插入图片描述

对于决策树模型,树深读使用4, 其他与树有关的模型树深度都使用9

从训练结果看Voting模型和和GDBT模型较好, 从测试集精度、召回率, 但GDBT预测所用的时间比Voting少得多

train_scoretest_scorerecall_scorepredict_time
KNeigh0.8860350.7985070.7985070.022123
Decisi0.8346710.8320900.8320900.002992
Random0.9085070.8358210.8358210.109242
Gradie0.8699840.8507460.8507460.001995
AdaBoo0.9711080.8283580.8283580.010971
XGB0.9614770.8395520.8395520.008977
Baggin0.9502410.8432840.8432840.007978
Voting0.9390050.8619400.8619400.149600

在这里插入图片描述

手写代码实现查找第一层泰坦尼克求生预测的特征

没有做One-Hot 处理前的特征排序如下:

name_featuregda
4Embarked0.021059
7age_group0.024532
5SibSp_Parch0.068934
0Pclass0.083831
2Ticket0.098996
1Sex0.217660
6name_title0.244520
3Fare0.437122

做One-Hot后的特征排序如下:

name_featuregda
36age_group_2.00.00010281105861986717
33Embarked_Q0.00013288715824610886
28Ticket_L0.0002616832410812231
38age_group_4.00.0002962695821370209
26Ticket_C0.00032492277940243675
39age_group_5.00.0005036305907689664
41age_group_7.00.0005087157575672796
40age_group_6.00.0007113917322836283
12name_title_Officer0.0007392000119662567
13name_title_Royalty0.0007753039393163519
27Ticket_F0.0008203473878271028
30Ticket_S0.0009051823841961237
21Ticket_60.001100156767643301
19Ticket_40.0012842019086071188
24Ticket_90.0015518860521344102
23Ticket_80.0015704375758041067
20Ticket_50.0023573628344824016
31Ticket_W0.0027235933228008102
22Ticket_70.002763045593756619
37age_group_3.00.004660998930798521
43age_group_9.00.004664457204693551
17Ticket_20.00504810127099653
8name_title_Master0.005063132198816933
42age_group_8.00.005516519162683808
35age_group_1.00.010184239725661515
25Ticket_A0.01281719846510787
29Ticket_P0.015976162283304673
34Embarked_S0.01720377994142208
32Embarked_C0.019913005288076824
4Embarked0.02105850109994778
7age_group0.024532429830416258
18Ticket_30.03383132649296061
16Ticket_10.035267123106138
5SibSp_Parch0.06893376008169072
9name_title_Miss0.07846190141510256
0Pclass0.0838310452960116
11name_title_Mrs0.08530661666666117
2Ticket0.09899556877902549
14Sex_female0.2176601066606142
15Sex_male0.2176601066606142
1Sex0.2176601066606142
10name_title_Mr0.22628903828782787
6name_title0.24452024023612073
3Fare0.43712176756435306
def choose_bestfeature(data_train):
    n_category = set(data_train['Survived'])
    n_sample = len(data_train)
    hd = 0
    gdas = {'name_feature':[], 'gda':[]}
    for cat in n_category:
        # print(cat, np.sum(data_train['Survived']==cat))
        p0 = np.sum(data_train['Survived']==cat) / n_sample
        hd -= p0 * np.log2(p0)

    for col_name in data_train.drop('Survived', axis=1).columns:
        # print(col_name)
        t = data_train.groupby(data_train[col_name])['Survived'].value_counts()
        # print("\t", t)
        group_name = set(data_train[col_name])

        # print("\tgroup_name", group_name)
        had = 0
        for name_index in group_name:
            grope_sum_sample = t[name_index].sum()
            p1 = grope_sum_sample / n_sample

            ha = 0
            for category in n_category:
                if (name_index, category) not in t.index:
                    continue
                # print("\t{}:{}".format(name_index, category), t[name_index][category])
                p2 = t[name_index][category] / grope_sum_sample
                ha -= p2 * np.log2(p2)
            had += p1 * ha
        gda = hd - had
        gdas['name_feature'].append(col_name)
        gdas['gda'].append(gda)
        # print("\t", hd, had)
        # print('\t', col_name, "-->", gda)

    print(hd)
    importance = pd.DataFrame(data=gdas).sort_values(by='gda')
    print(importance)
    # for i, j in zip(range(len(importance)), importance.index):
    #     print('|', j,"|", importance.iloc[i, 0], "|", importance.iloc[i, 1], "|")

附录

代码区

import pandas as pd
import numpy as np
from collections import Counter
from matplotlib import colors, pyplot as plt
from pandas.tseries.offsets import Second
import seaborn as sns
# from sklearn.feature_selection import SelectKBest
# from sklearn.feature_selection import chi2
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import Normalizer
from sklearn.model_selection import learning_curve
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn import ensemble
from sklearn import model_selection
import xgboost as xgb
from time import time
from sklearn.metrics import precision_score, recall_score

# 读取数据
data_train = pd.read_csv('3_FeatureENG_AutoML/preview_materials/a6_titanic/data/train.csv')

###########################################分析数据############################################
# print(data_train.info())
'''
从数据信息中可知,
1. 'Age ', 'Cabin', 'Embarked' 这三个特征有数据缺失 
2. 'Name', 'Sex', 'Ticket','Cabin', 'Embarked' 这几个特征类型为 'object', 需要做编码处理以使起能加入计算
RangeIndex: 891 entries, 0 to 890    
Data columns (total 12 columns):     
PassengerId    891 non-null int64    
Survived       891 non-null int64    
Pclass         891 non-null int64    
Name           891 non-null object   
Sex            891 non-null object   
Age            714 non-null float64  
SibSp          891 non-null int64    
Parch          891 non-null int64    
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
'''

# pd.set_option('display.max_rows',None)
# print(data_train.describe())
'''
从数据上看, 数字如年龄, 票价这些量纲不一致,
在涉及到与距离有关的模型,如果不做无量纲处理,会导致收敛变慢或者无法收敛
       PassengerId    Survived      Pclass         Age       SibSp       Parch        Fare
count   891.000000  891.000000  891.000000  714.000000  891.000000  891.000000  891.000000
mean    446.000000    0.383838    2.308642   29.699118    0.523008    0.381594   32.204208
std     257.353842    0.486592    0.836071   14.526497    1.102743    0.806057   49.693429
min       1.000000    0.000000    1.000000    0.420000    0.000000    0.000000    0.000000
25%     223.500000    0.000000    2.000000   20.125000    0.000000    0.000000    7.910400
50%     446.000000    0.000000    3.000000   28.000000    0.000000    0.000000   14.454200
75%     668.500000    1.000000    3.000000   38.000000    1.000000    0.000000   31.000000
max     891.000000    1.000000    3.000000   80.000000    8.000000    6.000000  512.329200
'''

# 用于画出两两变量之间的相关系数,称为地热图
def view_heatmap():
    train_corr = data_train.drop('PassengerId', axis=1).corr()
    print(train_corr)
    plt.figure(figsize=(8, 6))
    fig = sns.heatmap(train_corr, vmin=-1, vmax=1, annot=True)
    ax = plt.gca()
    ax.set_xlim(0,5)
    ax.set_ylim(0,5)
    plt.title("Correlation between variables")
    plt.savefig('5_Tree/task/picture/heatmap.jpg')
    plt.show()

'''从相关系数看相对来说SibSp, Parch 相关新在所有变量之间稍高,可以尝试去除其中一个看效果,或者两个相加作为一个新特征
及两两变量之间的相关系数, 为1表示两变量线性相关,如果两变量无关系为0, 只为0或者数字较小不能说明不相关'''
# view_heatmap()


################################################# 数据清洗  #####################################
# Cabin 数据只剩20%左右,考虑丢弃。也可把NaN的当作一类看看。(当前直接丢)

#从相关系数上看,两个信息可合扩展为一个特征试试
data_train['SibSp_Parch'] = data_train['SibSp'] + data_train['Parch']

# Embarked 缺失数据填充
# print(Counter(df['Embarked']))
'''Counter({'S': 644, 'C': 168, 'Q': 77, nan: 2})
从结果看'Q' 的比较少,把两个缺失值就直接填充为Q了'''
data_train['Embarked'] = data_train['Embarked'].fillna('Q')

# 'Name' 的数据处理
# 也可用该操作替代获取称呼 data_train['name_title'] = data_train['Name'].apply(lambda x: x.split(',')[1].split('.')[0].strip())
data_train['name_title'] = data_train['Name'].str.extract('.+,(.+)', expand=False).str.extract('^(.+?)\.', expand=False).str.strip()
data_train['name_title'].replace(['Capt', 'Col', 'Major', 'Dr', 'Rev'], 'Officer', inplace=True)
data_train['name_title'].replace(['Jonkheer', 'Don', 'Sir', 'the Countess', 'Dona', 'Lady'], 'Royalty' , inplace=True)
data_train['name_title'].replace(['Mme', 'Ms', 'Mrs'], 'Mrs', inplace=True)
data_train['name_title'].replace(['Mlle', 'Miss'], 'Miss', inplace=True)
data_train['name_title'].replace(['Mr'], 'Mr' , inplace=True)
data_train['name_title'].replace(['Master'], 'Master', inplace=True)

# Ticket 的数据处理->取出首字符
data_train['Ticket'] = data_train['Ticket'].str[0]


object_feature = ['name_title', 'Sex', 'Ticket', 'Embarked']

# 用于在绘图使用到的点处显示所要显示的数据或文本信息
def plt_text(X, Y, fmt='%d', pos_adjust=0, color='black', rotation=0):
    for x, y in zip(X, Y):
        plt.text(x, y + pos_adjust, fmt % y , ha='center', va= 'bottom',fontsize=9, color=color, rotation=rotation)

# 画出各 ‘object’ 类数据在各种情况下的幸存率
def object_analysis(feature_list):
    fig = plt.figure(figsize=(10, 8))
    fig.subplots_adjust(wspace=0.25,hspace=0.15, bottom=0.05, top=0.9)
    bar_width = 0.2
    fig_number = 221
    for feature_name in feature_list:
        plt.subplot(fig_number)
        index_values = data_train[feature_name].value_counts()
        x_label, values = index_values.index, index_values.values
        x_index = np.arange(len(x_label))
        plt.bar(x=x_index-bar_width, height=values, width=bar_width*2, tick_label=x_label)
        plt_text(x_index-bar_width, values, pos_adjust=5, rotation=90)
        plt.xticks(rotation=360)

        plt.twinx()
        Survived_rate = data_train.groupby(data_train[feature_name])['Survived'].mean()
        plt.bar(x_index+bar_width, Survived_rate[x_label], width=bar_width*2, color='r')
        plt_text(x_index+bar_width, Survived_rate[x_label], fmt='%.2f', pos_adjust=0.01, rotation=90)

        fig_number += 1
    plt.suptitle("The picture on the left is category number, \nand on the right is the Survived rate of various categories", color='r')
    plt.savefig('5_Tree/task/picture/survived_rate_for_categories.jpg')
    plt.show()

'''从图像结果看,'name_title', 'Sex', 'Ticket', 'Embarked'都挺重要, 因为很明显的能看出各情况下幸存率有很大差别'''
# object_analysis(object_feature)

# 查看Age段获救情况, 缺失的年龄先用0 表示
def view_age_group_survived():
    data_train['age_group'] = data_train['Age'].apply(lambda x: x // 10 + 1 if pd.notnull(x) else 0)
    group_age_msg = data_train['age_group'].value_counts().sort_index()
    group_index, group_values = group_age_msg.index ,group_age_msg.values
    group_age_msg.plot(kind='bar')
    plt_text(group_index, group_values, pos_adjust=5)
    plt.twinx()
    groups_Survived = data_train.groupby(data_train['age_group'])['Survived'].mean()
    groups_Survived.plot(kind='line', color='r')
    plt_text(group_index, groups_Survived, fmt='%.4f', pos_adjust=-0.03)
    plt.savefig('5_Tree/task/picture/survived_rate_for_age.jpg')
    plt.show()

'''从图中可知,小孩和老人获救的概率大,中间年龄段相差不大, 因此采取年龄离散化处理
可以考虑以10岁或者15,20岁为一个年龄段标示, 当前选择10为一个年龄段, 对于nan的数据采用模型来预测处于什么年龄段'''
# view_age_group_survived()

# 对Age 做特征工程, 把年龄分段,并用模型预测丢失的年龄
def age_feature_engineer():
    data_train['age_group'] = data_train['Age'].apply(lambda x: int(x // 10 + 1) if pd.notnull(x) else x)
    data_use_age = data_train.filter(regex='age_group|Survived|SibSp|Parch|Fare|Embarked_.*|Sex_.*|Pclass_.*|name_title_.*|Ticket_.*')
    train_age_known = data_use_age.loc[data_use_age.age_group.notnull()]
    predict_age_unknown = data_use_age.loc[data_use_age.age_group.isnull()]

    x_train, age = train_age_known.drop('age_group', axis=1), train_age_known['age_group']
    x_test = predict_age_unknown.drop('age_group', axis=1)

    mode_predict_age = RandomForestClassifier(random_state=0, n_estimators=30, n_jobs=-1)
    mode_predict_age.fit(x_train, age)
    
    predict_ages = mode_predict_age.predict(x_test)

    '''为何pd.Series(predict_ages)后填充会出现问题,
    使用finall必须pd.Series(predict_ages) (这是因为finall 有严格的类型检查吗), 
    为啥使用后面这种就可以不pd.Series(predict_ages)?   data_train.loc[data_use_age.age_group.isnull(), 'age_group'] = predict_ages '''
    # predict_ages = pd.Series(predict_ages)
    # data_train['age_group'].fillna(predict_ages, inplace=True)
    # print(len(predict_ages))
    data_train.loc[data_use_age.age_group.isnull(), 'age_group'] = predict_ages
    # print(data_train['age_group'])
    # print(data_train['age_group'].isnull().value_counts())

# 标记需要one-Hot 的特征
need_one_hot_feature = ['name_title', 'Sex', 'Ticket', 'Embarked', 'age_group']
# 在分析 'name_title', 'Sex', 'Ticket', 'Embarked' 特种重要后,需要做进一步的处理(如在使用的模型与距离相关,
# 做独热编码处理;如果是树模型,序号编码是最好的(省存储)) 'age_group' 考虑到年龄段少就做一下onr-Hot, 不做应该也行,看实验结果
def one_hot_hander(feature_list):
    one_hot_feature = []
    for feature_name in feature_list:
        one_hot_feature.append(pd.get_dummies(data_train[feature_name], prefix= feature_name))
    return one_hot_feature

# 标记需要做无量纲化的数据
need_dimensionless_feature = ['Fare', 'Pclass', 'Parch', 'SibSp_Parch']
# 对需要做无量纲的特征做处理
def dimensionless_processing(feature):
    for feature_name in feature:
        scaler = Normalizer()
        data_train[feature_name] = (scaler.fit_transform([data_train[feature_name]])).T

# 用sklearn的learning_curve得到training_score和cv_score,使用matplotlib画出learning curve
def plot_learning_curve(estimator, title, X, y, fig_position=None, ylim=None, cv=None, n_jobs=1,
                        train_sizes=np.linspace(0.6, 1, 5), verbose=0, plot=True):
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=5,n_jobs=n_jobs, train_sizes=train_sizes, verbose=verbose)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    # plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
    # plt.rcParams['axes.unicode_minus']=False #用来正常显示负号
    if plot:
        # plt.figure()
        if fig_position != None:
            plt.subplot(fig_position)
        plt.title(title)
        if ylim is not None:
            plt.ylim(*ylim)
        plt.xlabel(u"训练样本数training samples num")
        plt.ylabel(u"得分score")
        # plt.gca().invert_yaxis()
        plt.grid()

        plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std,
                         alpha=0.1, color="b")
        plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std,
                         alpha=0.1, color="r")
        plt.plot(train_sizes, train_scores_mean, 'o-', color="b", label=u"训练集上得分score over train data set")
        plt.plot(train_sizes, test_scores_mean, 'o-', color="r", label=u"测试集上得分score over test data set")

        # plt.legend(loc="best")

        # plt.draw()
        # plt.gca().invert_yaxis()
        # plt.show()

    midpoint = ((train_scores_mean[-1] + train_scores_std[-1]) + (test_scores_mean[-1] - test_scores_std[-1])) / 2
    diff = (train_scores_mean[-1] + train_scores_std[-1]) - (test_scores_mean[-1] - test_scores_std[-1])
    return midpoint, diff

def search_k():
    k = np.arange(1, 15)
    train_score = []
    test_score = []
    for i in k:
        mode = KNeighborsClassifier(n_neighbors=i)
        ret = cross_validate(mode, x_train, y, cv=5, return_train_score=True)
        train_score.append(ret['train_score'].mean())
        test_score.append(ret['test_score'].mean())
    plt.plot(k, np.array(train_score), label='train_score')
    plt.plot(k, np.array(test_score), label='test_score')
    plt.xticks(k)
    plt.xlabel("n_neighbors")
    plt.ylabel('score')
    plt.legend()
    plt.grid(True)
    # plt.savefig('5_Tree/task/picture/k_score.jpg')
    plt.show()

# 检验树深度与精度情况
def tree_depth_acc_relaption():
    mode_3 = DecisionTreeClassifier()

    train_score, test_score = [], []
    tree_depth = range(3, 31, 1)
    for i in tree_depth:
        mode_3.set_params(max_depth=i)
        score = cross_validate(mode_3, x_train, y, cv=5, return_train_score=True)
        train_score.append(score['train_score'].mean())
        test_score.append(score['test_score'].mean())
    plt.plot(tree_depth, train_score, label=u"score over train data set")
    plt.plot(tree_depth, test_score, label=u"score over test data set")
    plt.xlabel('tree depth')
    plt.ylabel('score')
    plt.xticks(tree_depth[::2])
    plt.legend()
    plt.grid()
    plt.savefig('5_Tree/task/picture/tree_depth_score.jpg')
    plt.show()

def varints_mode_train(x_train, y):
    knn = KNeighborsClassifier(n_neighbors=3, leaf_size=50)
    DT = DecisionTreeClassifier(max_depth=4)
    RF = RandomForestClassifier(random_state=0, n_estimators=30, max_depth=9, n_jobs=-1)
    GBDT = ensemble.GradientBoostingClassifier(n_estimators=50, max_features=9)
    adaboosk = ensemble.AdaBoostClassifier(DecisionTreeClassifier(max_depth=9), n_estimators=50)
    xgboost = xgb.XGBClassifier()
    bagging = ensemble.BaggingClassifier(base_estimator=DecisionTreeClassifier(max_depth=9), n_estimators=50) # random_state=7

    estimators = [knn, DT, RF, GBDT, adaboosk, xgboost]
    voting_estimators = list(zip(['knn', 'DT', 'RF', 'GBDT', 'adaboosk', 'xgboost'], estimators))
    voting = ensemble.VotingClassifier(voting_estimators, voting='soft')

    estimators.extend([bagging, voting])
    df= pd.DataFrame()
    
    x_train, X_test,  y_train, y_test = train_test_split(x_train, y, test_size=0.3, stratify=y, random_state=7)

    for clf in estimators:
        # score = cross_validate(clf, x_train, y, cv=5, return_train_score=True, return_estimator=True)
        clf.fit(x_train, y_train)
        start = time()
        y_predict = clf.predict(X_test)
        end = time()
        mode_name = clf.__class__.__name__.replace('Classifier', '')[:6]
        df.loc[mode_name, 'train_score'] = np.mean(clf.predict(x_train)==y_train)
        df.loc[mode_name, 'test_score'] = np.mean(y_predict==y_test)
        df.loc[mode_name, 'recall_score'] = recall_score(y_test, y_predict, average='micro')
        df.loc[mode_name, 'predict_time'] = end - start
    
    print(df)
    plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
    plt.rcParams['axes.unicode_minus']=False #用来正常显示负号
    ax = df.plot(kind='line', secondary_y=['predict_time'])
    plt.xticks(rotation=90)
    # plt.grid(axis='x') # 与y轴平行得网格出不来,不知道为啥
    # plt.xlabel('mode name') # 这样显示不了,不知道为啥
    ax.set_xlabel('mode name', color='r')
    ax.set_ylabel('accuracy', color='r')
    ax.right_ax.set_ylabel('predict time', color='r')
    # plt.savefig('5_Tree/task/picture/result_variants_mode.jpg')
    plt.show()

def choose_bestfeature(data_train):
    n_category = set(data_train['Survived'])
    n_sample = len(data_train)
    hd = 0
    gdas = {'name_feature':[], 'gda':[]}
    for cat in n_category:
        # print(cat, np.sum(data_train['Survived']==cat))
        p0 = np.sum(data_train['Survived']==cat) / n_sample
        hd -= p0 * np.log2(p0)

    for col_name in data_train.drop('Survived', axis=1).columns:
        # print(col_name)
        t = data_train.groupby(data_train[col_name])['Survived'].value_counts()
        # print("\t", t)
        group_name = set(data_train[col_name])

        # print("\tgroup_name", group_name)
        had = 0
        for name_index in group_name:
            grope_sum_sample = t[name_index].sum()
            p1 = grope_sum_sample / n_sample

            ha = 0
            for category in n_category:
                if (name_index, category) not in t.index:
                    continue
                # print("\t{}:{}".format(name_index, category), t[name_index][category])
                p2 = t[name_index][category] / grope_sum_sample
                ha -= p2 * np.log2(p2)
            had += p1 * ha
        gda = hd - had
        gdas['name_feature'].append(col_name)
        gdas['gda'].append(gda)
        # print("\t", hd, had)
        # print('\t', col_name, "-->", gda)

    print(hd)
    importance = pd.DataFrame(data=gdas).sort_values(by='gda')
    print(importance)
    # for i, j in zip(range(len(importance)), importance.index):
    #     print('|', j,"|", importance.iloc[i, 0], "|", importance.iloc[i, 1], "|")

if __name__== '__main__':
    age_feature_engineer()
    one_hot_feature = one_hot_hander(need_one_hot_feature)
    # 特征扩展, 把onehont后的数据加入
    data_train = pd.concat([data_train]+one_hot_feature, axis=1)
    # 归一化后的数据在函数内部就地更新
    dimensionless_processing(need_dimensionless_feature)

    train = data_train.filter(regex='Survived|SibSp_Parch|Fare|Embarked_.*|Sex_.*|Pclass_.*|name_title_.*|Ticket_.*|age_group_.*')
    x_train, y = train.drop('Survived', axis=1), train['Survived']    

    # 搜索较好的最近邻个数
    # search_k()

    # 从结果看出, 树深度过深会引起过拟合,从图中观察,树深度可选4~9 均可
    # tree_depth_acc_relaption()

    # 查看各模型在训练集,测试集的预测精度即预测用时,召回率情况
    # varints_mode_train(x_train, y)
    # result
    #             train_score  test_score  recall_score  predict_time
    # KNeigh     0.886035    0.798507           1.0      0.047006
    # Decisi     0.834671    0.824627           1.0      0.001995
    # Random     0.908507    0.835821           1.0      0.124480
    # Gradie     0.860353    0.850746           1.0      0.001995
    # AdaBoo     0.971108    0.843284           1.0      0.012960
    # XGB        0.961477    0.839552           1.0      0.008976
    # Baggin     0.942215    0.843284           1.0      0.008944
    # Voting     0.939005    0.861940           1.0      0.170543

    # data_train = data_train.filter(regex='Survived|^Embarked$|^Sex$|^Pclass$|^name_title$|^age_group$|^Ticket$|Fare|SibSp_Parch')
    data_train = data_train.filter(regex='Survived|Embarked|Sex|Pclass|name_title|age_group|Ticket|Fare|SibSp_Parch')
    choose_bestfeature(data_train)

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值