泰坦尼克号求生预测
运行环境说明
Equipment environment:
system: Win10 64
python version: 3.7.4
matplotlib version: 3.1.1
numpy version: 1.16.5
sklearn version: 0.21.3
pandas version: 0.25.1
seaborn version: 0.9.0
collections version:
对于Titanic获救概率进行分析
1. 数据分析
1.1 查看数据信息
print(data_train.info())
从数据信息中可知:
- 'Age ', ‘Cabin’, ‘Embarked’ 这三个特征有数据缺失, 'Cabin’只剩20%左右的数据;
- ‘Name’, ‘Sex’, ‘Ticket’,‘Cabin’, ‘Embarked’ 这几个特征类型为 ‘object’, 需要做编码处理以使起能加入计算;
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
pd.set_option('display.max_rows',None)
print(data_train.describe())
从数据上看, 数字如年龄, 票价这些量纲不一致,在涉及到与距离有关的模型,如果不做无量纲处理,会导致收敛变慢或者无法收敛
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
‘’’
查看特征之间的相关系数
从相关系数看相对来说SibSp, Parch 相关新在所有变量之间稍高,可以尝试去除其中一个看效果,或者两个相加作为一个新特征及两两变量之间的相关系数, 为1表示两变量线性相关,如果两变量无关系为0, 只为0或者数字较小不能说明不相关
view_heatmap()
2. 数据清洗
2.1 空值处理
从上可知只有’Age ', ‘Cabin’, ‘Embarked’ 这三个特征有数据缺失, 其中’Cabin’只剩20%左右的数据,直接丢弃
2.1.1 Embarked空值处理
print(Counter(df['Embarked']))
Counter({‘S’: 644, ‘C’: 168, ‘Q’: 77, nan: 2})
从结果看’Q’ 的比较少,把两个缺失值就直接填充为Q了
data_train['Embarked'] = data_train['Embarked'].fillna('Q')
2.1.2 Age空值处理
通过该函数view_age_group_survived()产看Age年龄段存活率的情况。 (具体代码请看代码区)
从图中分布看年龄小的和年龄大的容易获救. 0位置处为年龄缺失的总人数和存活率
考虑到每个年龄画柱状图会比较不容易观察,所以以10岁为一个年龄段来画图查看,不影响分析
- 在通过查看后可把年龄分段处理,可考虑以10岁或者15,20岁为一个年龄段标示, 当前选择10为一个年龄段,对于nan的数据采用模型来预测处于什么年龄段
# 对Age 做特征工程, 把年龄分段,并用模型预测丢失的年龄
def age_feature_engineer():
data_train['age_group'] = data_train['Age'].apply(lambda x: int(x // 10 + 1) if pd.notnull(x) else x)
data_use_age = data_train.filter(regex='age_group|Survived|SibSp|Parch|Fare|Embarked_.*|Sex_.*|Pclass_.*|name_title_.*|Ticket_.*')
train_age_known = data_use_age.loc[data_use_age.age_group.notnull()]
predict_age_unknown = data_use_age.loc[data_use_age.age_group.isnull()]
x_train, age = train_age_known.drop('age_group', axis=1), train_age_known['age_group']
x_test = predict_age_unknown.drop('age_group', axis=1)
mode_predict_age = RandomForestClassifier(random_state=0, n_estimators=30, n_jobs=-1)
mode_predict_age.fit(x_train, age)
predict_ages = mode_predict_age.predict(x_test)
data_train.loc[data_use_age.age_group.isnull(), 'age_group'] = predict_ages
2.2 对字类特征做分组
2.2.1对Name做分组
国外从称呼上可以大概判断该人时属于政府人员,皇室…还是普通人, 因此先把称呼提取出来
# 也可用该操作替代获取称呼 data_train['name_title'] = data_train['Name'].apply(lambda x: x.split(',')[1].split('.')[0].strip())
data_train['name_title'] = data_train['Name'].str.extract('.+,(.+)', expand=False).str.extract('^(.+?)\.', expand=False).str.strip()
data_train['name_title'].replace(['Capt', 'Col', 'Major', 'Dr', 'Rev'], 'Officer', inplace=True)
data_train['name_title'].replace(['Jonkheer', 'Don', 'Sir', 'the Countess', 'Dona', 'Lady'], 'Royalty' , inplace=True)
data_train['name_title'].replace(['Mme', 'Ms', 'Mrs'], 'Mrs', inplace=True)
data_train['name_title'].replace(['Mlle', 'Miss'], 'Miss', inplace=True)
data_train['name_title'].replace(['Mr'], 'Mr' , inplace=True)
data_train['name_title'].replace(['Master'], 'Master', inplace=True)
2.2.2 对Ticket做分组
看着数据,Ticket列里每个元素的首个字符应该可以判断票号类别
# Ticket 的数据处理->取出首字符
data_train['Ticket'] = data_train['Ticket'].str[0]
2.3 对定性数据做one-Hot
- 对字符类数据分组后,可以查看各组对应的存活率。 通过object_analysis() 函数可查看,具体请看代码区
从图中看称呼,性别,船票号和登船港口都有偏重在某个信息下获救概率大,所以认为这些信息都重要,从图中据得船票号还可以做下处理, 把比较高的, 次高的,低的分为三类(目前该项目没这样处理)
把上面分析到都挺重要的数据做编码处理(此处选择独热编码)
# 标记需要one-Hot 的特征
need_one_hot_feature = ['name_title', 'Sex', 'Ticket', 'Embarked', 'age_group']
# 在分析 'name_title', 'Sex', 'Ticket', 'Embarked' 特种重要后,需要做进一步的处理(如在使用的模型与距离相关,
# 做独热编码处理;如果是树模型,序号编码是最好的(省存储)) 'age_group' 考虑到年龄段少就做一下onr-Hot, 不做应该也行,看实验结果
def one_hot_hander(feature_list):
one_hot_feature = []
for feature_name in feature_list:
one_hot_feature.append(pd.get_dummies(data_train[feature_name], prefix= feature_name))
return one_hot_feature
2.4 特征扩展
在数据分析中得到SibSp,Parch的相关系数情况,可对两个数据做和来扩展特征,后面视实验情况决定是否要替代SibSp,Parch这两个特征
data_train['SibSp_Parch'] = data_train['SibSp'] + data_train['Parch']
one_hot_feature = one_hot_hander(need_one_hot_feature)
# 特征扩展, 把onehont后的数据加入
data_train = pd.concat([data_train]+one_hot_feature, axis=1)
2.5 无量纲化
从data_train.describe()中可知’Fare’, ‘Pclass’, ‘Parch’, 'SibSp_Parch各特征宏数据相差几十倍,所以用无量纲化
# 标记需要做无量纲化的数据
need_dimensionless_feature = ['Fare', 'Pclass', 'Parch', 'SibSp_Parch']
# 对需要做无量纲的特征做处理
def dimensionless_processing(feature):
for feature_name in feature:
scaler = Normalizer()
data_train[feature_name] = (scaler.fit_transform([data_train[feature_name]])).T
3. 模型预测
搜索最好的树深度 (代码见tree_depth_acc_relaption())
** 从结果看出, 树深度过深会引起过拟合,从图中观察,树深度可选4~9 均可
对于决策树模型,树深读使用4, 其他与树有关的模型树深度都使用9
从训练结果看Voting模型和和GDBT模型较好, 从测试集精度、召回率, 但GDBT预测所用的时间比Voting少得多
train_score | test_score | recall_score | predict_time | |
---|---|---|---|---|
KNeigh | 0.886035 | 0.798507 | 0.798507 | 0.022123 |
Decisi | 0.834671 | 0.832090 | 0.832090 | 0.002992 |
Random | 0.908507 | 0.835821 | 0.835821 | 0.109242 |
Gradie | 0.869984 | 0.850746 | 0.850746 | 0.001995 |
AdaBoo | 0.971108 | 0.828358 | 0.828358 | 0.010971 |
XGB | 0.961477 | 0.839552 | 0.839552 | 0.008977 |
Baggin | 0.950241 | 0.843284 | 0.843284 | 0.007978 |
Voting | 0.939005 | 0.861940 | 0.861940 | 0.149600 |
手写代码实现查找第一层泰坦尼克求生预测的特征
没有做One-Hot 处理前的特征排序如下:
name_feature | gda | |
---|---|---|
4 | Embarked | 0.021059 |
7 | age_group | 0.024532 |
5 | SibSp_Parch | 0.068934 |
0 | Pclass | 0.083831 |
2 | Ticket | 0.098996 |
1 | Sex | 0.217660 |
6 | name_title | 0.244520 |
3 | Fare | 0.437122 |
做One-Hot后的特征排序如下:
name_feature | gda | |
---|---|---|
36 | age_group_2.0 | 0.00010281105861986717 |
33 | Embarked_Q | 0.00013288715824610886 |
28 | Ticket_L | 0.0002616832410812231 |
38 | age_group_4.0 | 0.0002962695821370209 |
26 | Ticket_C | 0.00032492277940243675 |
39 | age_group_5.0 | 0.0005036305907689664 |
41 | age_group_7.0 | 0.0005087157575672796 |
40 | age_group_6.0 | 0.0007113917322836283 |
12 | name_title_Officer | 0.0007392000119662567 |
13 | name_title_Royalty | 0.0007753039393163519 |
27 | Ticket_F | 0.0008203473878271028 |
30 | Ticket_S | 0.0009051823841961237 |
21 | Ticket_6 | 0.001100156767643301 |
19 | Ticket_4 | 0.0012842019086071188 |
24 | Ticket_9 | 0.0015518860521344102 |
23 | Ticket_8 | 0.0015704375758041067 |
20 | Ticket_5 | 0.0023573628344824016 |
31 | Ticket_W | 0.0027235933228008102 |
22 | Ticket_7 | 0.002763045593756619 |
37 | age_group_3.0 | 0.004660998930798521 |
43 | age_group_9.0 | 0.004664457204693551 |
17 | Ticket_2 | 0.00504810127099653 |
8 | name_title_Master | 0.005063132198816933 |
42 | age_group_8.0 | 0.005516519162683808 |
35 | age_group_1.0 | 0.010184239725661515 |
25 | Ticket_A | 0.01281719846510787 |
29 | Ticket_P | 0.015976162283304673 |
34 | Embarked_S | 0.01720377994142208 |
32 | Embarked_C | 0.019913005288076824 |
4 | Embarked | 0.02105850109994778 |
7 | age_group | 0.024532429830416258 |
18 | Ticket_3 | 0.03383132649296061 |
16 | Ticket_1 | 0.035267123106138 |
5 | SibSp_Parch | 0.06893376008169072 |
9 | name_title_Miss | 0.07846190141510256 |
0 | Pclass | 0.0838310452960116 |
11 | name_title_Mrs | 0.08530661666666117 |
2 | Ticket | 0.09899556877902549 |
14 | Sex_female | 0.2176601066606142 |
15 | Sex_male | 0.2176601066606142 |
1 | Sex | 0.2176601066606142 |
10 | name_title_Mr | 0.22628903828782787 |
6 | name_title | 0.24452024023612073 |
3 | Fare | 0.43712176756435306 |
def choose_bestfeature(data_train):
n_category = set(data_train['Survived'])
n_sample = len(data_train)
hd = 0
gdas = {'name_feature':[], 'gda':[]}
for cat in n_category:
# print(cat, np.sum(data_train['Survived']==cat))
p0 = np.sum(data_train['Survived']==cat) / n_sample
hd -= p0 * np.log2(p0)
for col_name in data_train.drop('Survived', axis=1).columns:
# print(col_name)
t = data_train.groupby(data_train[col_name])['Survived'].value_counts()
# print("\t", t)
group_name = set(data_train[col_name])
# print("\tgroup_name", group_name)
had = 0
for name_index in group_name:
grope_sum_sample = t[name_index].sum()
p1 = grope_sum_sample / n_sample
ha = 0
for category in n_category:
if (name_index, category) not in t.index:
continue
# print("\t{}:{}".format(name_index, category), t[name_index][category])
p2 = t[name_index][category] / grope_sum_sample
ha -= p2 * np.log2(p2)
had += p1 * ha
gda = hd - had
gdas['name_feature'].append(col_name)
gdas['gda'].append(gda)
# print("\t", hd, had)
# print('\t', col_name, "-->", gda)
print(hd)
importance = pd.DataFrame(data=gdas).sort_values(by='gda')
print(importance)
# for i, j in zip(range(len(importance)), importance.index):
# print('|', j,"|", importance.iloc[i, 0], "|", importance.iloc[i, 1], "|")
附录
代码区
import pandas as pd
import numpy as np
from collections import Counter
from matplotlib import colors, pyplot as plt
from pandas.tseries.offsets import Second
import seaborn as sns
# from sklearn.feature_selection import SelectKBest
# from sklearn.feature_selection import chi2
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import Normalizer
from sklearn.model_selection import learning_curve
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn import ensemble
from sklearn import model_selection
import xgboost as xgb
from time import time
from sklearn.metrics import precision_score, recall_score
# 读取数据
data_train = pd.read_csv('3_FeatureENG_AutoML/preview_materials/a6_titanic/data/train.csv')
###########################################分析数据############################################
# print(data_train.info())
'''
从数据信息中可知,
1. 'Age ', 'Cabin', 'Embarked' 这三个特征有数据缺失
2. 'Name', 'Sex', 'Ticket','Cabin', 'Embarked' 这几个特征类型为 'object', 需要做编码处理以使起能加入计算
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
'''
# pd.set_option('display.max_rows',None)
# print(data_train.describe())
'''
从数据上看, 数字如年龄, 票价这些量纲不一致,
在涉及到与距离有关的模型,如果不做无量纲处理,会导致收敛变慢或者无法收敛
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
'''
# 用于画出两两变量之间的相关系数,称为地热图
def view_heatmap():
train_corr = data_train.drop('PassengerId', axis=1).corr()
print(train_corr)
plt.figure(figsize=(8, 6))
fig = sns.heatmap(train_corr, vmin=-1, vmax=1, annot=True)
ax = plt.gca()
ax.set_xlim(0,5)
ax.set_ylim(0,5)
plt.title("Correlation between variables")
plt.savefig('5_Tree/task/picture/heatmap.jpg')
plt.show()
'''从相关系数看相对来说SibSp, Parch 相关新在所有变量之间稍高,可以尝试去除其中一个看效果,或者两个相加作为一个新特征
及两两变量之间的相关系数, 为1表示两变量线性相关,如果两变量无关系为0, 只为0或者数字较小不能说明不相关'''
# view_heatmap()
################################################# 数据清洗 #####################################
# Cabin 数据只剩20%左右,考虑丢弃。也可把NaN的当作一类看看。(当前直接丢)
#从相关系数上看,两个信息可合扩展为一个特征试试
data_train['SibSp_Parch'] = data_train['SibSp'] + data_train['Parch']
# Embarked 缺失数据填充
# print(Counter(df['Embarked']))
'''Counter({'S': 644, 'C': 168, 'Q': 77, nan: 2})
从结果看'Q' 的比较少,把两个缺失值就直接填充为Q了'''
data_train['Embarked'] = data_train['Embarked'].fillna('Q')
# 'Name' 的数据处理
# 也可用该操作替代获取称呼 data_train['name_title'] = data_train['Name'].apply(lambda x: x.split(',')[1].split('.')[0].strip())
data_train['name_title'] = data_train['Name'].str.extract('.+,(.+)', expand=False).str.extract('^(.+?)\.', expand=False).str.strip()
data_train['name_title'].replace(['Capt', 'Col', 'Major', 'Dr', 'Rev'], 'Officer', inplace=True)
data_train['name_title'].replace(['Jonkheer', 'Don', 'Sir', 'the Countess', 'Dona', 'Lady'], 'Royalty' , inplace=True)
data_train['name_title'].replace(['Mme', 'Ms', 'Mrs'], 'Mrs', inplace=True)
data_train['name_title'].replace(['Mlle', 'Miss'], 'Miss', inplace=True)
data_train['name_title'].replace(['Mr'], 'Mr' , inplace=True)
data_train['name_title'].replace(['Master'], 'Master', inplace=True)
# Ticket 的数据处理->取出首字符
data_train['Ticket'] = data_train['Ticket'].str[0]
object_feature = ['name_title', 'Sex', 'Ticket', 'Embarked']
# 用于在绘图使用到的点处显示所要显示的数据或文本信息
def plt_text(X, Y, fmt='%d', pos_adjust=0, color='black', rotation=0):
for x, y in zip(X, Y):
plt.text(x, y + pos_adjust, fmt % y , ha='center', va= 'bottom',fontsize=9, color=color, rotation=rotation)
# 画出各 ‘object’ 类数据在各种情况下的幸存率
def object_analysis(feature_list):
fig = plt.figure(figsize=(10, 8))
fig.subplots_adjust(wspace=0.25,hspace=0.15, bottom=0.05, top=0.9)
bar_width = 0.2
fig_number = 221
for feature_name in feature_list:
plt.subplot(fig_number)
index_values = data_train[feature_name].value_counts()
x_label, values = index_values.index, index_values.values
x_index = np.arange(len(x_label))
plt.bar(x=x_index-bar_width, height=values, width=bar_width*2, tick_label=x_label)
plt_text(x_index-bar_width, values, pos_adjust=5, rotation=90)
plt.xticks(rotation=360)
plt.twinx()
Survived_rate = data_train.groupby(data_train[feature_name])['Survived'].mean()
plt.bar(x_index+bar_width, Survived_rate[x_label], width=bar_width*2, color='r')
plt_text(x_index+bar_width, Survived_rate[x_label], fmt='%.2f', pos_adjust=0.01, rotation=90)
fig_number += 1
plt.suptitle("The picture on the left is category number, \nand on the right is the Survived rate of various categories", color='r')
plt.savefig('5_Tree/task/picture/survived_rate_for_categories.jpg')
plt.show()
'''从图像结果看,'name_title', 'Sex', 'Ticket', 'Embarked'都挺重要, 因为很明显的能看出各情况下幸存率有很大差别'''
# object_analysis(object_feature)
# 查看Age段获救情况, 缺失的年龄先用0 表示
def view_age_group_survived():
data_train['age_group'] = data_train['Age'].apply(lambda x: x // 10 + 1 if pd.notnull(x) else 0)
group_age_msg = data_train['age_group'].value_counts().sort_index()
group_index, group_values = group_age_msg.index ,group_age_msg.values
group_age_msg.plot(kind='bar')
plt_text(group_index, group_values, pos_adjust=5)
plt.twinx()
groups_Survived = data_train.groupby(data_train['age_group'])['Survived'].mean()
groups_Survived.plot(kind='line', color='r')
plt_text(group_index, groups_Survived, fmt='%.4f', pos_adjust=-0.03)
plt.savefig('5_Tree/task/picture/survived_rate_for_age.jpg')
plt.show()
'''从图中可知,小孩和老人获救的概率大,中间年龄段相差不大, 因此采取年龄离散化处理
可以考虑以10岁或者15,20岁为一个年龄段标示, 当前选择10为一个年龄段, 对于nan的数据采用模型来预测处于什么年龄段'''
# view_age_group_survived()
# 对Age 做特征工程, 把年龄分段,并用模型预测丢失的年龄
def age_feature_engineer():
data_train['age_group'] = data_train['Age'].apply(lambda x: int(x // 10 + 1) if pd.notnull(x) else x)
data_use_age = data_train.filter(regex='age_group|Survived|SibSp|Parch|Fare|Embarked_.*|Sex_.*|Pclass_.*|name_title_.*|Ticket_.*')
train_age_known = data_use_age.loc[data_use_age.age_group.notnull()]
predict_age_unknown = data_use_age.loc[data_use_age.age_group.isnull()]
x_train, age = train_age_known.drop('age_group', axis=1), train_age_known['age_group']
x_test = predict_age_unknown.drop('age_group', axis=1)
mode_predict_age = RandomForestClassifier(random_state=0, n_estimators=30, n_jobs=-1)
mode_predict_age.fit(x_train, age)
predict_ages = mode_predict_age.predict(x_test)
'''为何pd.Series(predict_ages)后填充会出现问题,
使用finall必须pd.Series(predict_ages) (这是因为finall 有严格的类型检查吗),
为啥使用后面这种就可以不pd.Series(predict_ages)? data_train.loc[data_use_age.age_group.isnull(), 'age_group'] = predict_ages '''
# predict_ages = pd.Series(predict_ages)
# data_train['age_group'].fillna(predict_ages, inplace=True)
# print(len(predict_ages))
data_train.loc[data_use_age.age_group.isnull(), 'age_group'] = predict_ages
# print(data_train['age_group'])
# print(data_train['age_group'].isnull().value_counts())
# 标记需要one-Hot 的特征
need_one_hot_feature = ['name_title', 'Sex', 'Ticket', 'Embarked', 'age_group']
# 在分析 'name_title', 'Sex', 'Ticket', 'Embarked' 特种重要后,需要做进一步的处理(如在使用的模型与距离相关,
# 做独热编码处理;如果是树模型,序号编码是最好的(省存储)) 'age_group' 考虑到年龄段少就做一下onr-Hot, 不做应该也行,看实验结果
def one_hot_hander(feature_list):
one_hot_feature = []
for feature_name in feature_list:
one_hot_feature.append(pd.get_dummies(data_train[feature_name], prefix= feature_name))
return one_hot_feature
# 标记需要做无量纲化的数据
need_dimensionless_feature = ['Fare', 'Pclass', 'Parch', 'SibSp_Parch']
# 对需要做无量纲的特征做处理
def dimensionless_processing(feature):
for feature_name in feature:
scaler = Normalizer()
data_train[feature_name] = (scaler.fit_transform([data_train[feature_name]])).T
# 用sklearn的learning_curve得到training_score和cv_score,使用matplotlib画出learning curve
def plot_learning_curve(estimator, title, X, y, fig_position=None, ylim=None, cv=None, n_jobs=1,
train_sizes=np.linspace(0.6, 1, 5), verbose=0, plot=True):
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=5,n_jobs=n_jobs, train_sizes=train_sizes, verbose=verbose)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
# plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
# plt.rcParams['axes.unicode_minus']=False #用来正常显示负号
if plot:
# plt.figure()
if fig_position != None:
plt.subplot(fig_position)
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel(u"训练样本数training samples num")
plt.ylabel(u"得分score")
# plt.gca().invert_yaxis()
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std,
alpha=0.1, color="b")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std,
alpha=0.1, color="r")
plt.plot(train_sizes, train_scores_mean, 'o-', color="b", label=u"训练集上得分score over train data set")
plt.plot(train_sizes, test_scores_mean, 'o-', color="r", label=u"测试集上得分score over test data set")
# plt.legend(loc="best")
# plt.draw()
# plt.gca().invert_yaxis()
# plt.show()
midpoint = ((train_scores_mean[-1] + train_scores_std[-1]) + (test_scores_mean[-1] - test_scores_std[-1])) / 2
diff = (train_scores_mean[-1] + train_scores_std[-1]) - (test_scores_mean[-1] - test_scores_std[-1])
return midpoint, diff
def search_k():
k = np.arange(1, 15)
train_score = []
test_score = []
for i in k:
mode = KNeighborsClassifier(n_neighbors=i)
ret = cross_validate(mode, x_train, y, cv=5, return_train_score=True)
train_score.append(ret['train_score'].mean())
test_score.append(ret['test_score'].mean())
plt.plot(k, np.array(train_score), label='train_score')
plt.plot(k, np.array(test_score), label='test_score')
plt.xticks(k)
plt.xlabel("n_neighbors")
plt.ylabel('score')
plt.legend()
plt.grid(True)
# plt.savefig('5_Tree/task/picture/k_score.jpg')
plt.show()
# 检验树深度与精度情况
def tree_depth_acc_relaption():
mode_3 = DecisionTreeClassifier()
train_score, test_score = [], []
tree_depth = range(3, 31, 1)
for i in tree_depth:
mode_3.set_params(max_depth=i)
score = cross_validate(mode_3, x_train, y, cv=5, return_train_score=True)
train_score.append(score['train_score'].mean())
test_score.append(score['test_score'].mean())
plt.plot(tree_depth, train_score, label=u"score over train data set")
plt.plot(tree_depth, test_score, label=u"score over test data set")
plt.xlabel('tree depth')
plt.ylabel('score')
plt.xticks(tree_depth[::2])
plt.legend()
plt.grid()
plt.savefig('5_Tree/task/picture/tree_depth_score.jpg')
plt.show()
def varints_mode_train(x_train, y):
knn = KNeighborsClassifier(n_neighbors=3, leaf_size=50)
DT = DecisionTreeClassifier(max_depth=4)
RF = RandomForestClassifier(random_state=0, n_estimators=30, max_depth=9, n_jobs=-1)
GBDT = ensemble.GradientBoostingClassifier(n_estimators=50, max_features=9)
adaboosk = ensemble.AdaBoostClassifier(DecisionTreeClassifier(max_depth=9), n_estimators=50)
xgboost = xgb.XGBClassifier()
bagging = ensemble.BaggingClassifier(base_estimator=DecisionTreeClassifier(max_depth=9), n_estimators=50) # random_state=7
estimators = [knn, DT, RF, GBDT, adaboosk, xgboost]
voting_estimators = list(zip(['knn', 'DT', 'RF', 'GBDT', 'adaboosk', 'xgboost'], estimators))
voting = ensemble.VotingClassifier(voting_estimators, voting='soft')
estimators.extend([bagging, voting])
df= pd.DataFrame()
x_train, X_test, y_train, y_test = train_test_split(x_train, y, test_size=0.3, stratify=y, random_state=7)
for clf in estimators:
# score = cross_validate(clf, x_train, y, cv=5, return_train_score=True, return_estimator=True)
clf.fit(x_train, y_train)
start = time()
y_predict = clf.predict(X_test)
end = time()
mode_name = clf.__class__.__name__.replace('Classifier', '')[:6]
df.loc[mode_name, 'train_score'] = np.mean(clf.predict(x_train)==y_train)
df.loc[mode_name, 'test_score'] = np.mean(y_predict==y_test)
df.loc[mode_name, 'recall_score'] = recall_score(y_test, y_predict, average='micro')
df.loc[mode_name, 'predict_time'] = end - start
print(df)
plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
plt.rcParams['axes.unicode_minus']=False #用来正常显示负号
ax = df.plot(kind='line', secondary_y=['predict_time'])
plt.xticks(rotation=90)
# plt.grid(axis='x') # 与y轴平行得网格出不来,不知道为啥
# plt.xlabel('mode name') # 这样显示不了,不知道为啥
ax.set_xlabel('mode name', color='r')
ax.set_ylabel('accuracy', color='r')
ax.right_ax.set_ylabel('predict time', color='r')
# plt.savefig('5_Tree/task/picture/result_variants_mode.jpg')
plt.show()
def choose_bestfeature(data_train):
n_category = set(data_train['Survived'])
n_sample = len(data_train)
hd = 0
gdas = {'name_feature':[], 'gda':[]}
for cat in n_category:
# print(cat, np.sum(data_train['Survived']==cat))
p0 = np.sum(data_train['Survived']==cat) / n_sample
hd -= p0 * np.log2(p0)
for col_name in data_train.drop('Survived', axis=1).columns:
# print(col_name)
t = data_train.groupby(data_train[col_name])['Survived'].value_counts()
# print("\t", t)
group_name = set(data_train[col_name])
# print("\tgroup_name", group_name)
had = 0
for name_index in group_name:
grope_sum_sample = t[name_index].sum()
p1 = grope_sum_sample / n_sample
ha = 0
for category in n_category:
if (name_index, category) not in t.index:
continue
# print("\t{}:{}".format(name_index, category), t[name_index][category])
p2 = t[name_index][category] / grope_sum_sample
ha -= p2 * np.log2(p2)
had += p1 * ha
gda = hd - had
gdas['name_feature'].append(col_name)
gdas['gda'].append(gda)
# print("\t", hd, had)
# print('\t', col_name, "-->", gda)
print(hd)
importance = pd.DataFrame(data=gdas).sort_values(by='gda')
print(importance)
# for i, j in zip(range(len(importance)), importance.index):
# print('|', j,"|", importance.iloc[i, 0], "|", importance.iloc[i, 1], "|")
if __name__== '__main__':
age_feature_engineer()
one_hot_feature = one_hot_hander(need_one_hot_feature)
# 特征扩展, 把onehont后的数据加入
data_train = pd.concat([data_train]+one_hot_feature, axis=1)
# 归一化后的数据在函数内部就地更新
dimensionless_processing(need_dimensionless_feature)
train = data_train.filter(regex='Survived|SibSp_Parch|Fare|Embarked_.*|Sex_.*|Pclass_.*|name_title_.*|Ticket_.*|age_group_.*')
x_train, y = train.drop('Survived', axis=1), train['Survived']
# 搜索较好的最近邻个数
# search_k()
# 从结果看出, 树深度过深会引起过拟合,从图中观察,树深度可选4~9 均可
# tree_depth_acc_relaption()
# 查看各模型在训练集,测试集的预测精度即预测用时,召回率情况
# varints_mode_train(x_train, y)
# result
# train_score test_score recall_score predict_time
# KNeigh 0.886035 0.798507 1.0 0.047006
# Decisi 0.834671 0.824627 1.0 0.001995
# Random 0.908507 0.835821 1.0 0.124480
# Gradie 0.860353 0.850746 1.0 0.001995
# AdaBoo 0.971108 0.843284 1.0 0.012960
# XGB 0.961477 0.839552 1.0 0.008976
# Baggin 0.942215 0.843284 1.0 0.008944
# Voting 0.939005 0.861940 1.0 0.170543
# data_train = data_train.filter(regex='Survived|^Embarked$|^Sex$|^Pclass$|^name_title$|^age_group$|^Ticket$|Fare|SibSp_Parch')
data_train = data_train.filter(regex='Survived|Embarked|Sex|Pclass|name_title|age_group|Ticket|Fare|SibSp_Parch')
choose_bestfeature(data_train)