泰坦尼克号

在这里插入图片描述

注意这里的一号仓入口
请添加图片描述

船长让妇女儿童先上船,但先上船的妇女儿童都是头等舱的。

请添加图片描述请添加图片描述

得出简单结论:1.头等舱生存几率大。2.妇女儿童生存几率大。3.前两个因素都是人为造成的

一.数据探索

我们首先导入数据

import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')


train_df = pd.read_csv(r'C:\Users\dahuo\Desktop\数据集2\train.csv')
print(train_df.columns)
PassengerId => 乘客ID
Pclass => 乘客等级(1/2/3等舱位)
Name => 乘客姓名
Sex => 性别
Age => 年龄
SibSp => 堂兄弟/妹个数
Parch => 父母与小孩个数
Ticket => 船票信息
Fare => 票价
Cabin => 客舱
Embarked => 登船港口

数据特征分为:连续值和离散值

  • 离散值:性别(男,女) 登船地点(S,Q,C)船舱等级(1,2,3)

  • 连续值:年龄,船票价格

接着我们来看看人员生存情况

# 查看生存比例
fig,axes = plt.subplots(1,2,figsize=(18,8))
train_df['Survived'].value_counts().plot.pie(explode=[0,0.1],autopct='%1.1f%%',ax=axes[0],shadow=True)
axes[0].set_title('Survived')
axes[0].set_ylabel(' ')

sns.countplot('Survived',data=train_df,ax=axes[1])
axes[1].set_title('Survived')

这里只是训练集的数据
在这里插入图片描述

探索离散特征
单特征
整个船上当中只有38.4%的人员存活了下来,接下来查看各个属性和留存的关系吧!

# 性别和生存率的关系
fig,axes = plt.subplots(1,2,figsize=(18,8))
train_df[['Sex','Survived']].groupby(['Sex']).mean().plot.bar(ax=axes[0])
axes[0].set_title('Survived vs Sex')

sns.countplot('Sex',hue='Survived',data=train_df,ax=axes[1])
axes[1].set_title('Sex:Survived vs Dead')

在这里插入图片描述

从第一张图可以看出来,女性的获救比例在75%,而男性还不足20% 从右图,又可以观察出来,船上男性的数量比女性多的多,但是存活率却很低 这是一个很有区分度的特征,必须使用

接下来我们看看船舱等级和生存率的关系

# 船舱等级和生存率的关系
print(pd.crosstab(train_df['Pclass'],train_df['Survived'],margins=True).style.background_gradient(cmap='summer_r'))

# 船舱等级和生存率的关系
fig,axes = plt.subplots(1,2,figsize=(18,8))
train_df['Pclass'].value_counts().plot.bar(ax=axes[0])
axes[0].set_title('Number Of Passengers By Pclass')
axes[0].set_ylabel('Count')

sns.countplot('Pclass',hue='Survived',data=train_df,ax=axes[1])
axes[1].set_title('Pclass:Survived vs Dead')

在这里插入图片描述
由左图可知,船舱登记为3的乘客最低,但是在右图中对应的存活率却是很低的,船舱登记为1和2的存活率较高,我们可以分析出,1,2船舱的乘客级别高,富有,被救援的机会大。

# 登陆港口和生存率的关系
fig,axes = plt.subplots(1,2,figsize=(18,8))
train_df['Embarked'].value_counts().plot.bar(ax=axes[0])
axes[0].set_title('Number Of Passengers By Embarked')
axes[0].set_ylabel('Count')

sns.countplot('Embarked',hue='Survived',data=train_df,ax=axes[1])
axes[1].set_title('Embarked:Survived vs Dead')

在这里插入图片描述
S、Q没有明显差别,S的登船人数最多。C登船港口有明显的区别,登船人数最少,且获救比例高于50%,我们暂时先记住这个特征。

我们现在来看看登船港口和性别与生存率情况

sns.barplot('Pclass','Survived',hue='Sex',data=train_df)

请添加图片描述

1港口的存活比例是比较高的,这与之前的分析是一致的,3港口的男性存活率最低,大概只有15%左右

登船港口和船舱等级与生存率

sns.factorplot('Pclass','Survived',hue='Sex',col='Embarked',data=train_df)
plt.show()

在这里插入图片描述

探索连续特征

# 年龄
print('Oldest Passenger was of:',train_df['Age'].max(),'Years')
print('Youngest Passenger was of:',train_df['Age'].min(),'Years')
print('Average Age on the ship:',train_df['Age'].mean(),'Years')
Oldest Passenger was of: 80.0 Years
Youngest Passenger was of: 0.42 Years
Average Age on the ship: 29.69911764705882 Years
# 绘图可视化年龄、船舱等级、性别对生存的影响和关系
fig,axes = plt.subplots(1,3,figsize=(18,8))
sns.violinplot('Pclass','Age',hue='Survived',data=train_df,split=True,ax=axes[0])
axes[0].set_title('Pclass and Age vs Survived')
axes[0].set_yticks(range(0,110,10))

sns.violinplot("Sex","Age", hue="Survived", data=train_df,split=True,ax=axes[1])
axes[1].set_title('Sex and Age vs Survived')
axes[1].set_yticks(range(0,110,10))

sns.violinplot("Embarked","Age", hue="Survived", data=train_df,split=True,ax=axes[2])
axes[2].set_title('Embarked and Age vs Survived')
axes[2].set_yticks(range(0,110,10))

在这里插入图片描述
结论:

  • 10岁以下儿童的存活率随passenger数量增加而增加
  • 20-50岁的获救几率更高一些
  • Q港口45以上和5岁以下没有生还可能

接下来将数据进行填充

# 按照分组来求平均年龄
train_df.groupby('Initial')['Age'].mean()
Initial
Master     4.574167
Miss      21.860000
Mr        32.739609
Mrs       35.981818
Other     45.888889
Name: Age, dtype: float64
# 使用每组的均值来进行填充缺失值
train_df.loc[(train_df.Age.isnull())&(train_df.Initial=='Mr'),'Age']=33
train_df.loc[(train_df.Age.isnull())&(train_df.Initial=='Mrs'),'Age']=36
train_df.loc[(train_df.Age.isnull())&(train_df.Initial=='Master'),'Age']=5
train_df.loc[(train_df.Age.isnull())&(train_df.Initial=='Miss'),'Age']=22
train_df.loc[(train_df.Age.isnull())&(train_df.Initial=='Other'),'Age']=46
# 绘图可视化
fig,axes = plt.subplots(1,2,figsize=(20,10))
train_df[train_df['Survived']==0].Age.plot.hist(ax=axes[0],bins=20,edgecolor='black',color='red')
axes[0].set_title('Survived= 0')
x1 = list(range(0,85,5))
axes[0].set_xticks(x1)

train_df[train_df['Survived']==1].Age.plot.hist(ax=axes[1],color='green',bins=20,edgecolor='black')
axes[1].set_title('Survived= 1')
x2 = list(range(0,85,5))
axes[1].set_xticks(x2)

在这里插入图片描述

sns.factorplot('Pclass','Survived',col='Initial',data=train_df)

在这里插入图片描述

# 为Embarked属性填充缺失值
# 使用众数来填充港口缺失值,众数为S港
train_df['Embarked'].fillna('S',inplace=True)
# 挖掘兄弟姐妹的数量
pd.crosstab(train_df['SibSp'],train_df['Survived']).style.background_gradient(cmap='summer_r')
# 绘图显示兄弟姐妹数量和存活的关系
fig,axes = plt.subplots(1,2,figsize=(20,8))
sns.barplot('SibSp','Survived',data=train_df,ax=axes[0])
axes[0].set_title('SibSp vs Survived')

sns.factorplot('SibSp','Survived',data=train_df,ax=axes[1])
axes[1].set_title('SibSp vs Survived')

plt.close(2)

在这里插入图片描述

# 是否有兄弟姐妹和船舱等级关系
pd.crosstab(train_df['SibSp'],train_df['Pclass']).style.background_gradient(cmap='summer_r')
# 父母的数量
pd.crosstab(train_df['Parch'],train_df['Pclass']).style.background_gradient(cmap='summer_r')
# 绘图可视化出是否有父母和孩子与存活之间的关系
f,ax = plt.subplots(1,2,figsize=(20,8))
sns.barplot('Parch','Survived',data=train_df,ax=ax[0])
ax[0].set_title('Parch vs Survived')
sns.factorplot('Parch','Survived',data=train_df,ax=ax[1])
ax[1].set_title('Parch vs Survived')
plt.close(2)
plt.show()

在这里插入图片描述

# 船票价格
print('Highest Fare was:',train_df['Fare'].max())
print('Lowest Fare was:',train_df['Fare'].min())
print('Average Fare was:',train_df['Fare'].mean())
Highest Fare was: 512.3292
Lowest Fare was: 0.0
Average Fare was: 32.204207968574636
# 绘图可视化
f,ax = plt.subplots(1,3,figsize=(20,8))
sns.distplot(train_df[train_df['Pclass']==1].Fare,ax=ax[0])
ax[0].set_title('Fares in Pclass 1')
sns.distplot(train_df[train_df['Pclass']==2].Fare,ax=ax[1])
ax[1].set_title('Fares in Pclass 2')
sns.distplot(train_df[train_df['Pclass']==3].Fare,ax=ax[2])
ax[2].set_title('Fares in Pclass 3')

在这里插入图片描述

总结:
  • 性别:与男性相比,女性的生存机会很高。
  • Pclass:一级船舱的存活率很高,很遗憾对于pclass3成活率很低。
  • 年龄:小于5-10岁的儿童存活率高。年龄在15到35岁之间的乘客死亡很多。
  • 港口:上来的仓位也有区别,死亡率也很大!
  • 家庭:有1-2的兄弟姐妹、配偶或父母上1-3显示而不是独自一人或有一个大家庭旅行,你有更大的概率存活。

二数据清洗

import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
train_df = pd.read_csv(r'C:\Users\dahuo\Desktop\数据集2\train.csv')
print(train_df.columns)
# 绘制热力图来查看先有各个特征之间的相关性
sns.heatmap(train_df.corr(),annot=True,cmap='RdYlGn',linewidths=0.2)

在这里插入图片描述

由上图可知:特征不显著相关。

缺失值填充
填充年龄缺失值

train_df['Initial'] = train_df.Name.str.extract('([A-Za-z]+)\.') 
train_df['Initial'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess','Jonkheer','Col','Rev','Capt','Sir','Don'],['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr'],inplace=True)
# 按照分组来求平均年龄
train_df.groupby('Initial')['Age'].mean()
# 使用每组的均值来进行填充缺失值
train_df.loc[(train_df.Age.isnull())&(train_df.Initial=='Mr'),'Age']=33
train_df.loc[(train_df.Age.isnull())&(train_df.Initial=='Mrs'),'Age']=36
train_df.loc[(train_df.Age.isnull())&(train_df.Initial=='Master'),'Age']=5
train_df.loc[(train_df.Age.isnull())&(train_df.Initial=='Miss'),'Age']=22
train_df.loc[(train_df.Age.isnull())&(train_df.Initial=='Other'),'Age']=46

填充港口缺失值

# 使用众数S
train_df.Embarked.fillna('S',inplace=True)
train_df.Embarked.isnull().any()

数据清洗

# 年龄特征离散化
# 将乘客的年龄分布分为5组,一组的年龄间距为16
train_df.loc[train_df['Age']<=16,'Age_band'] = 0;
train_df.loc[(train_df['Age']>16)&(train_df['Age']<=32),'Age_band'] = 1;
train_df.loc[(train_df['Age']>32)&(train_df['Age']<=48),'Age_band'] = 2;
train_df.loc[(train_df['Age']>48)&(train_df['Age']<=64),'Age_band'] = 3;
train_df.loc[train_df['Age']>64,'Age_band'] = 4;
# 增加家庭总人口数
train_df['Family_num'] = train_df['Parch'] + train_df['SibSp']
# 增加是否孤独一人特征
train_df['Alone'] = 0
train_df.loc[train_df['Family_num']==0,'Alone'] = 1;
fig,ax = plt.subplots(1,2,figsize=(18,6))
sns.factorplot('Family_num','Survived',data=train_df,ax=ax[0])
ax[0].set_title('Family_Size vs Survived')

sns.factorplot('Alone','Survived',data=train_df,ax=ax[1])
ax[1].set_title('Alone vs Survived')

plt.close(2)
plt.close(3)

在这里插入图片描述
从上图可以看出来,孤独的人生还的可能性只有30%,与有亲属的相差较多,同时还可以看出,在亲属数量>=4个的时候,生还可能性开始下降

sns.factorplot('Alone','Survived',data=train_df,hue='Sex',col='Pclass')

在这里插入图片描述

# 船票离散化
train_df['Fare_range'] = pd.qcut(train_df['Fare'],4)
train_df.groupby(['Fare_range'])['Survived'].mean().to_frame()
train_df.loc[train_df['Fare']<=7.91,'Fare_cat'] = 0
train_df.loc[(train_df['Fare']>7.91)&(train_df['Fare']<=14.454),'Fare_cat']=1
train_df.loc[(train_df['Fare']>14.454)&(train_df['Fare']<=31),'Fare_cat']=2
train_df.loc[(train_df['Fare']>31)&(train_df['Fare']<=513),'Fare_cat']=3
sns.lineplot('Fare_cat','Survived',hue='Sex',data=train_df)
plt.show()

在这里插入图片描述

从上图可以看出,生存几率随船票价格增加而增加

# 将离散特征转化为数字
train_df['Sex'].replace(['male','female'],[0,1],inplace=True)
train_df['Embarked'].replace(['S','C','Q'],[0,1,2],inplace=True)
train_df['Initial'].replace(['Mr','Mrs','Miss','Master','Other'],[0,1,2,3,4],inplace=True)
# 去掉姓名、年龄、票号、票价、船舱号等特征
train_df.drop(['Name','Age','Ticket','Fare','Cabin','Fare_range','PassengerId'],axis=1,inplace=True)
train_df.info()
sns.heatmap(train_df.corr(),annot=True,cmap='RdYlGn',linewidths=0.2)
fig = plt.gcf()
fig.set_size_inches(18,15)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.show()

在这里插入图片描述


# 将处理后的数据暂时保存下来
train_df.to_csv(r'C:\Users\dahuo\Desktop\数据集2\processed_train.csv',encoding='utf-8',index=False)

三训练模型

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import confusion_matrix


from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict

from sklearn.model_selection import GridSearchCV

划分数据集

data = pd.read_csv(r'C:\Users\dahuo\Desktop\数据集2\processed_train.csv')
train_df,test_df = train_test_split(data,test_size=0.3,random_state=0,stratify=data['Survived'])

train_X = train_df[train_df.columns[1:]]
train_y = train_df[train_df.columns[:1]]

test_X = test_df[test_df.columns[1:]]
test_y = test_df[test_df.columns[:1]]

X = data[data.columns[1:]]
y = data['Survived']

SVM(高斯核)

model = svm.SVC(kernel='rbf',C=1,gamma=0.1)
model.fit(train_X,train_y)
prediction = model.predict(test_X)
metrics.accuracy_score(prediction,test_y)

SVM(线性)

model = svm.SVC(kernel='linear',C=0.1,gamma=0.1)
model.fit(train_X,train_y)
prediction = model.predict(test_X)
metrics.accuracy_score(prediction,test_y)

** LR**

model = LogisticRegression()
model.fit(train_X,train_y)
prediction = model.predict(test_X)
metrics.accuracy_score(prediction,test_y)

决策树

model = DecisionTreeClassifier()
model.fit(train_X,train_y)
prediction = model.predict(test_X)
metrics.accuracy_score(prediction,test_y)

KNN

model = KNeighborsClassifier(n_neighbors=9)
model.fit(train_X,train_y)
prediction = model.predict(test_X)
metrics.accuracy_score(prediction,test_y)
0.835820895522388
predict_arr = pd.Series()
for i in range(1,11):
    model = KNeighborsClassifier(n_neighbors=i)
    model.fit(train_X,train_y)
    prediction = model.predict(test_X)
    predict_arr = predict_arr.append(pd.Series(metrics.accuracy_score(prediction,test_y)))

# 绘图展示
index = list(range(1,11))
plt.plot(index,predict_arr)
plt.xticks(list(range(0,11)))
print(predict_arr.values.max)

在这里插入图片描述

朴素贝叶斯

model = GaussianNB()
model.fit(train_X,train_y)
prediction = model.predict(test_X)
metrics.accuracy_score(prediction,test_y)
0.8134328358208955

随机森林

model = RandomForestClassifier(n_estimators=100)
model.fit(train_X,train_y)
prediction = model.predict(test_X)
metrics.accuracy_score(prediction,test_y)

交叉验证

kfold = KFold(n_splits=10,random_state=2019)
classifiers=['Linear Svm','Radial Svm','Logistic Regression','KNN','Decision Tree','Naive Bayes','Random Forest']
models = [svm.SVC(kernel='linear'),svm.SVC(kernel='rbf'),LogisticRegression(),KNeighborsClassifier(n_neighbors=9),DecisionTreeClassifier(),GaussianNB(),RandomForestClassifier(n_estimators=100)]

mean = []
std = []
auc = []
for model in models:
    res = cross_val_score(model,X,y,cv=kfold,scoring='accuracy')
    mean.append(res.mean())
    std.append(res.std())
    auc.append(res)
    
res_df = pd.DataFrame({'Mean':mean,'Std':std},index=classifiers)

绘制各个算法相应的箱型图

plt.subplots(figsize=(12,6))
box = pd.DataFrame(auc,index=[classifiers])
box.T.boxplot()

在这里插入图片描述

在这里插入图片描述

绘制混淆矩阵,给出正确分类和不正确分类的数量

fig,ax = plt.subplots(3,3,figsize=(12,10))

y_pred = cross_val_predict(svm.SVC(kernel='rbf'),X,y,cv=10)
sns.heatmap(confusion_matrix(y,y_pred),ax=ax[0,0],annot=True,fmt='2.0f')
ax[0,0].set_title('Matrix for rbf-SVM')

y_pred = cross_val_predict(svm.SVC(kernel='linear'),X,y,cv=10)
sns.heatmap(confusion_matrix(y,y_pred),ax=ax[0,1],annot=True,fmt='2.0f')
ax[0,1].set_title('Matrix for Linear-SVM')

y_pred = cross_val_predict(KNeighborsClassifier(n_neighbors=9),X,y,cv=10)
sns.heatmap(confusion_matrix(y,y_pred),ax=ax[0,2],annot=True,fmt='2.0f')
ax[0,2].set_title('Matrix for KNN')

y_pred = cross_val_predict(RandomForestClassifier(n_estimators=100),X,y,cv=10)
sns.heatmap(confusion_matrix(y,y_pred),ax=ax[1,0],annot=True,fmt='2.0f')
ax[1,0].set_title('Matrix for Random-Forests')

y_pred = cross_val_predict(LogisticRegression(),X,y,cv=10)
sns.heatmap(confusion_matrix(y,y_pred),ax=ax[1,1],annot=True,fmt='2.0f')
ax[1,1].set_title('Matrix for Logistic Regression')

y_pred = cross_val_predict(DecisionTreeClassifier(),X,y,cv=10)
sns.heatmap(confusion_matrix(y,y_pred),ax=ax[1,2],annot=True,fmt='2.0f')
ax[1,2].set_title('Matrix for Decision Tree')

y_pred = cross_val_predict(GaussianNB(),X,y,cv=10)
sns.heatmap(confusion_matrix(y,y_pred),ax=ax[2,0],annot=True,fmt='2.0f')
ax[2,0].set_title('Matrix for Naive Bayes')
plt.subplots_adjust(hspace=0.2,wspace=0.2)
plt.show()

在这里插入图片描述
利用网格搜索进行超参数设定

C = [0.05,0.1,0.2,0.3,0.25,0.4,0.5,0.6,0.7,0.8,0.9,1]
gamma = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]
kernel = ['rbf','linear']

hyper = {'kernel':kernel,'C':C,'gamma':gamma}
gd = GridSearchCV(estimator=svm.SVC(),param_grid=hyper,verbose=True)
gd.fit(X,y)
print(gd.best_estimator_)
print(gd.best_score_)
C = 0.5
gamma = 0.1
n_estimators = range(100,1000,100)
hyper = {'n_estimators':n_estimators}

gd = GridSearchCV(estimator=RandomForestClassifier(random_state=0),param_grid=hyper,verbose=True)
gd.fit(X,y)
print(gd.best_score_)
print(gd.best_estimator_)
                         Mean       Std
Linear Svm           0.790075  0.033587
Radial Svm           0.828240  0.036060
Logistic Regression  0.809213  0.022927
KNN                  0.809176  0.026277
Decision Tree        0.817004  0.046226
Naive Bayes          0.803620  0.037507
Random Forest        0.815880  0.038200
Fitting 5 folds for each of 240 candidates, totalling 1200 fits
SVC(C=0.4, gamma=0.3)
0.8282593685267716
Fitting 5 folds for each of 9 candidates, totalling 45 fits
0.819327098110602
RandomForestClassifier(n_estimators=300, random_state=0)

四模型融合

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import GridSearchCV


from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import confusion_matrix

from sklearn.ensemble import VotingClassifier
import xgboost as xgb
from sklearn.ensemble import AdaBoostClassifier



data = pd.read_csv(r'C:\Users\dahuo\Desktop\数据集2\processed_train.csv')
train_df,test_df = train_test_split(data,test_size=0.3,random_state=0,stratify=data['Survived'])

train_X = train_df[train_df.columns[1:]]
train_y = train_df[train_df.columns[:1]]

test_X = test_df[test_df.columns[1:]]
test_y = test_df[test_df.columns[:1]]

X = data[data.columns[1:]]
y = data['Survived']



#投票分类器
estimators = [
    ('linear_svm',svm.SVC(kernel='linear',probability=True)),
    ('rbf_svm',svm.SVC(kernel='rbf',C=0.5,gamma=0.1,probability=True)),
    ('lr',LogisticRegression(C=0.05)),
    ('knn',KNeighborsClassifier(n_neighbors=10)),
    ('rf',RandomForestClassifier(n_estimators=500,random_state=0)),
    ('dt',DecisionTreeClassifier(random_state=0)),
    ('nb',GaussianNB())
]
vc = VotingClassifier(estimators=estimators,voting='soft')
vc.fit(train_X,train_y)
print('The accuracy for ensembled model is:',vc.score(test_X,test_y))

# cross = cross_val_score(vc,X,y,cv=10,scoring='accuracy')
# mean=cross.mean()
# print(mean)


#xgboost
fig,ax = plt.subplots(2,2,figsize=(15,12))
model = xgb.XGBClassifier(n_estimators=900,learning_rate=0.1)
model.fit(X,y)
pd.Series(model.feature_importances_,X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[0,0])
ax[0,0].set_title('Feature Importance in XgBoost')
plt.grid()

model = RandomForestClassifier(n_estimators=500,random_state=0)
model.fit(X,y)
pd.Series(model.feature_importances_,X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[0,1])
ax[0,1].set_title('Feature Importance in Random Forest')
plt.grid()

model=AdaBoostClassifier(n_estimators=200,learning_rate=0.05,random_state=0)
model.fit(X,y)
pd.Series(model.feature_importances_,X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[1,0])
ax[1,0].set_title('Feature Importance in Random Forest')
plt.grid()

model = RandomForestClassifier(n_estimators=500,random_state=0)
model.fit(X,y)
pd.Series(model.feature_importances_,X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[0,1])
ax[0,1].set_title('Feature Importance in Random Forest')

plt.grid()
plt.show()

在这里插入图片描述

泰坦尼克号数据_泰坦尼克号数据分析报告 891名乘客中遇难乘客有549⼈,占61.6%,⽣还乘客342⼈,占38.4%.各等级船舱乘客⼈数 各等级船舱乘客⼈数 Pclass_count=titanic_data['Pclass'].value_counts().sort_index() #⽤Bar_pie()函数作条形图和饼状图 Bar_pie(Pclass_count) 三等船舱乘客最多,占55.1%;⼀等船舱次之占24.2%;⼆级船舱乘客最少,占20.7%.男⼥乘客分布情况 男⼥乘客分布情况 Sex_count=titanic_data['Sex'].value_counts() print(Sex_count) Bar_pie(Sex_count) male 577 female 314 Name: Sex, dtype: int64 男乘客有577⼈,占64.8%;⼥乘客有314⼈,占35.2%.乘客年龄分布情况 乘客年龄分布情况 In [84]: #乘客年龄分布直⽅图 #创建figure、subplot,并⽤hist作条形图 fig_Age=plt.figure(figsize=(10,5)) ax_Age=fig_Age.add_subplot(1,2,1) titanic_data['Age'].hist(bins=10,color='g',alpha=0.3,grid=False) #设置x轴刻度标签 ax_Age.set_xticks([0,10,20,30,40,50,60,70,80,90,100]) #添加标题,x轴标签,y轴标签 ax_Age.set_title('Hist plot of Age') ax_Age.set_xlabel('Age') ax_Age.set_ylabel('number of people') #乘客年龄分布箱线图 #作箱线图 plt.subplot(122) titanic_data.boxplot(column='Age',showfliers=False) #添加y轴标签 plt.ylabel('Age') plt.title('boxplot of Fare') titanic_data['Age'].describe() count 891.000000 mean 29.544332 std 13.013778 min 0.000000 25% 22.000000 50% 29.000000 75% 35.000000 max 80.000000 Name: Age, dtype: float64 乘客年龄⼤概成正态分布,平均年龄29岁多,最⼤的80岁,最⼩的不到1岁(利⽤int()取整,不到1岁的为0).兄弟姐妹、配偶在船上的 兄弟姐妹、配偶在船上的 乘客分布情况条形图 乘客分布情况条形图 #创建figure、subplot,⽤plot()作柱状图 fig_SibSp=plt.figure(figsize=(10,5)) ax_SibSp=fig_SibSp.add_subplot(1,2,1) SibSp_count=titanic_data['SibSp'].value_counts() SibSp_count.plot(kind='bar') #添加标题,x轴标签,y轴标签 ax_SibSp.set_title('Bar plot of SibSp') ax_SibSp.set_xlabel('number of SibSp') ax_SibSp.set_ylabel('number of people') #拥有各 数量的兄弟姐妹、配偶的乘客⽐例条形图 plt.subplot(122) SibSp_count.div(SibSp_count.sum()).plot(kind='bar') #添加标题,x、y轴 标签 plt.title('Ratio of people in SibSp') plt.xlabel('SibSp') plt.ylabel('ratio') 在船上没有兄弟姐妹配偶的乘客较多,占68.2%.⽗母、孩⼦在船上的乘客分布条形图 ⽗母、孩⼦在船上的乘客分布条形图 Parch_count=titanic_data['Parch'].value_counts() #创建figure、subplot,⽤plot()作柱状图 fig_Parch=plt.figure(figsize=(10,5)) ax_Parch=fig_Parch.add_subplot(1,2,1) Parch_count.plot(kind='bar') #添加标题,x、y轴标签 ax_Parch.set_title('Bar plot of Parch') ax
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值