kaggle入门题泰坦尼克号乘客生存预测

 题目是需要通过乘客的年龄,性别,舱级,社会阶级,上船地点,船上家人数量的信息对乘客的生存死亡做一个预测。

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

1. 数据集情况查看以及数据清洗(填补空缺值)

import matplotlib.pyplot as plt
import seaborn as sns
 
train = pd.read_csv('../input/titanic/train.csv')
test = pd.read_csv('../input/titanic/test.csv')
    
    # 这里需要注意的是,如果没有后面的ignore_index=True
    #那么index的值在连接后的这个新数据中是不连续的 继续从 0开始,如果要按照index删除一行数据,可能会发现多删一条。
full = pd.concat([train, test], ignore_index=True)
full.head()  # 默认显示5行

#查看字符型数据情况:
full.describe(include=['O'])

 

full.describe().T
 
#describe()函数只能查看数据类型的描述统计信息,无法查看类似字符类型的信息。
#故需用info()函数进一步查看每一列的数据信息。
full.info()

 

print(full.isnull().sum()) # 查看数据的缺失情况

 

#用众数填补Embarked    
#查看众数
full.Embarked.mode()
full['Embarked'].fillna('S',inplace=True)

 

#填补Fare空缺值,用Pclass==3的客舱票价的中位数来填充
full[full.Fare.isnull()]#查看Fare缺失值的信息,获得Pclass
full.Fare.fillna(full[full.Pclass==3]['Fare'].median(),inplace=True)
#年龄(Age) 最小值为0.17,不存在0值,其数据缺失率为263/1309=20.09%,由于Age的平均数与中位数接近,故选择平均值作为缺失项的填充值。
full.Age.fillna(full.Age.mean(),inplace = True)
# full['Age']=full['Age'].fillna(full['Age'].mean())
full.Age.describe()

 full['Cabin'] = full['Cabin'].fillna('U')
full.Cabin.isnull().sum()
#1014
 
#创造新特征Cabin_exist,判断Cabin与Survived的相关性,相关性不大则删除
full['Cabin_exist'] = full['Cabin'].map(lambda x:"Yes" if type(x)==str else "No")# 判断类型是否相等
full[["Cabin_exist", "Survived"]].groupby("Cabin_exist",as_index=False).mean()

 

 

full =full.drop('Cabin_exist',axis = 1)
full.head()
 
#将字符转换为数值
full['Cabin_exist'] = full['Cabin'].map(lambda x: 1 if type(x)==str else 0)
full = full.drop('Cabin',axis = 1)
full.head()
 
# full.loc[full.Cabin.notnull(),'Cabin'] = 1
# full.loc[full.Cabin.isnull(),'Cabin'] = 0
# full.Cabin.isnull().sum()#验证填充效果
 
sns.barplot(x="Cabin_exist",y="Survived",data=full)
full.isnull().sum()
#数据处理完毕
 
full.head()

 2. 特征工程

#1,探究Sex与Survived的相关性
full[['Sex','Survived']].groupby('Sex',as_index = False).mean().sort_values('Survived',ascending = False)
sns.countplot(x = 'Sex',hue = 'Survived',data =full)

#Sex(性别),将字符映射到字符
sex_dict = {'male': 1,'female':0}
full['Sex'] = full['Sex'].map(sex_dict)
full['Sex'].head()
 
full[["Pclass","Sex","Survived"]].groupby(["Pclass","Sex"],as_index=False).mean().sort_values(by="Survived",ascending=False)
 
#两维变量关系图
sns.factorplot(x="Pclass",y="Survived",hue="Sex",data=full)

 

 #2,探究Pcalss与Survived的关联性,相关性较高
full[["Pclass","Survived"]].groupby(["Pclass"],as_index = False).mean().sort_values(by="Survived",ascending=False)
sns.barplot(x="Pclass",y="Survived",data=full)

 

#将客舱类型进行独热编码
PclassDf = pd.get_dummies(full['Pclass'],prefix = 'Pclass')
PclassDf.head()
#将编码后的客舱特征与原数据合并
full = pd.concat([full,PclassDf],axis = 1)
full.head()
#删除Pclass
ful = full.drop('Pclass',axis = 1)
full.head()

 

Embarked:不同的上船地点对生存率是否有影响

full[["Embarked","Survived"]].groupby("Embarked",as_index=False).count().sort_values("Survived",ascending=False)
sns.barplot(x="Embarked",y="Survived",data=full)

sns.factorplot(x="Sex",y="Survived",hue="Embarked",data=full)
full[["Sex","Survived","Embarked"]].groupby(["Sex","Embarked"],as_index=False).count().sort_values("Survived",ascending=False)

 

S口岸,登船人数644,女性乘客占比46%;C口岸,登船人数168,女性占比接近77%; Q口岸,登船人数77,女性占比接近88%。 前面已知女性生存率明显高于男性生存率,所以上述问题可能由性别因素引起。

 

#对Embarked进行独热编码
EmbarkedDf = pd.get_dummies(full['Embarked'],prefix = 'Embarked')
EmbarkedDf.head()
# 将EmbarkedDf的特征添加至full数据集
full = pd.concat([full,EmbarkedDf],axis = 1)#按列插入数据
full.head()
# 因为已经使用登船港口(Embarked)进行了one-hot编码产生了它的虚拟变量(dummy variables)
# 所以这里把登船港口(Embarked)删掉
full = full.drop('Embarked',axis = 1)
full.head()

 从name里提取title(头衔),地位高的人在灾难里会更有可能获得救援从而得以存活

def getTitle(Name):
    s1 = Name.split(',')[1]#Braund, Mr. Owen Harris	分割完取第二个元素Mr. Owen Harris
    s2 = s1.split('.')[0]#得到Mr
    return s2.strip( ) #strip() 方法用于移除字符串头尾指定的字符(这里是空格)
full['Title'] = full['Name'].map(getTitle)#将getTitle映射给Name  得到Title
full['Title'].value_counts()
 
full.drop('Name',axis = 1,inplace  = True)
full.head()
 

将头衔分为五类:

Officer:政府官员; Royalty:王室(皇室); Mr:已婚男士; Mrs:已婚妇女; Miss:年轻未婚女子; Master:有技能的人/教师

 

pd.crosstab(full.Title,full.Sex)#透视表
#探索title与生存的关系
full[["Title","Survived"]].groupby("Title",as_index=False).mean().sort_values("Survived",ascending=False)
 
sns.barplot(x="Title",y="Survived",data=full)

 

full['Title'].value_counts()
Mr              757
Miss            260
Mrs             197
Master           61
Rev               8
Dr                8
Col               4
Mlle              2
Major             2
Ms                2
Lady              1
Sir               1
Mme               1
Don               1
Capt              1
the Countess      1
Jonkheer          1
Dona              1
Name: Title, dtype: int64
#独热编码
TitleDf = pd.get_dummies(full['Title'],prefix = 'Title')
#添加进full
full = pd.concat([full,TitleDf],axis = 1)
full.head()
#删除不需要的列
full = full.drop(['Title'],axis = 1)
full.head()

 

Cabin与Survived之间的相关性

客场号的类别值是首字母,因此我们提取客舱号的首字母为特征 full['Cabin'] = full['Cabin'].map(lambda x : x[0]) full['Cabin'].value_counts full['Cabin']=full['Cabin'].map(lambda x:x[0]) full['Cabin'].value_counts() 因为之前用处理过,所以不再提取首字母,如果用u来填充的话,可以用这种方法

探究孤身一人和有家人陪伴的生存率(SibSp,Parch)

full[['SibSp','Survived']].groupby(['SibSp'],as_index = False).mean().sort_values('Survived',ascending = False)
sns.barplot(x = 'SibSp',y = 'Survived',data = full)
full[["Parch","Survived"]].groupby("Parch",as_index=False).mean().sort_values("Survived",ascending=False)
sns.barplot(x = 'Parch',y = 'Survived',data = full)
# 构建家庭人数和家庭类别的新特征
full['family'] = full['Parch'] + full['SibSp'] + 1#Parch,SibSp为0时,只有自己一个人,+1
full['Alone'] = np.where(full['family'] == 1,1,0)
 
full['family_small'] = np.where((full['family']>=2) & (full['family']<=4),1,0)
full['family_big'] = np.where(full['family']>=5,1,0)
full.head()
#探究家庭与存活率的关系  证实孤身一人存活率不高
full[['family','Survived']].groupby('family',as_index = False).mean().sort_values('Survived',ascending = False)
sns.barplot(x = 'family',y = 'Survived',data = full)
 
#孤身一人对生存率是否有影响
full[["Alone","Survived"]].groupby("Alone",as_index = False).mean().sort_values("Survived",ascending=False)
sns.barplot(x="Alone",y="Survived",data=full)
 
full = full.drop('family',axis = 1)
sns.factorplot(x="Pclass",y="Survived",hue="Alone",data=full)

探究年龄Age与Survived的相关性

#查看Age的分布情况
sns.violinplot(y="Age",data=full)
#查看生存与死亡乘客的年龄分布
sns.violinplot(y="Age",x="Survived",data=full)
#将年龄分为五组
full['AgeCut'] = pd.cut(full.Age,5)#要用full['AgeCut'],而不是full.AgeCut,这样AgeCut才会在Index里
full.AgeCut.value_counts().sort_index()
 
full[['AgeCut','Survived']].groupby('AgeCut',as_index = False).mean().sort_values('Survived',ascending = False)

 

#根据各个分段,重新给Age赋值
full.loc[full.Age <= 16.136,'Age'] = 1
full.loc[(full.Age > 16.136)&(full.Age <=32.102),'Age'] = 2
full.loc[(full.Age > 32.102)&(full.Age <=48.068),'Age'] = 3
full.loc[(full.Age > 48.068)&(full.Age <=64.034),'Age'] = 4
full.loc[full.Age > 64.034,'Age'] = 5
full.head()
 
AgeDf = pd.get_dummies(full['Age'],prefix =' Age')
full = pd.concat([full,AgeDf],axis = 1)
full = full.drop(['Age','AgeCut'],axis = 1)
full.head()
sns.violinplot(y="Fare",data=train)
#对比生死乘客的票价
sns.violinplot(y="Fare",x="Survived",data=train) 

 

# 当然这里也可以用seaborn的displot进行绘制,但是displot的纵坐标是比率,hist的纵坐标是实际个数count;
# figsize调整画布大小
full['Fare'].hist(color='green', bins=30, figsize=(8,4))

 

#分组
full['FareCut'] = pd.cut(full.Fare,5)
full.FareCut.value_counts().sort_index()
# full.head()
 
full[['FareCut','Survived']].groupby('FareCut',as_index = False).mean().sort_values('Survived',ascending = False)
#重新赋值
full.loc[full.Fare<=7.854,'Fare']=1
full.loc[(full.Fare>7.854)&(full.Fare<=10.5),'Fare']=2
full.loc[(full.Fare>10.5)&(full.Fare<=21.558),'Fare']=3
full.loc[(full.Fare>21.558)&(full.Fare<=41.579),'Fare']=4
full.loc[full.Fare>41.579,'Fare']=5
full.head()
 
#进行独热编码
FareDf = pd.get_dummies(full['Fare'],prefix = 'Fare')
full = pd.concat([full,FareDf],axis = 1)
full = full.drop(['Fare','FareCut'],axis = 1)
full.head()
 
full = full.drop(['SibSp','Parch','Pclass'],axis = 1)
full.head()

3.特征选择

full.info()
 
 
corr_df=full.corr()  
corr_df
 
#用图形直观查看线性相关系数
plt.figure(figsize=(16,16))
plt.title("Pearson Correlation of Features")
sns.heatmap(corr_df,linewidths=0.1,square=True,linecolor="white",annot=True,cmap='YlGnBu',vmin=-1,vmax=1)
 
corr_df['Survived'].sort_values(ascending = False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 42 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   PassengerId         1309 non-null   int64  
 1   Survived            891 non-null    float64
 2   Sex                 1309 non-null   int64  
 3   Ticket              1309 non-null   object 
 4   Cabin_exist         1309 non-null   int64  
 5   Pclass_1            1309 non-null   uint8  
 6   Pclass_2            1309 non-null   uint8  
 7   Pclass_3            1309 non-null   uint8  
 8   Embarked_C          1309 non-null   uint8  
 9   Embarked_Q          1309 non-null   uint8  
 10  Embarked_S          1309 non-null   uint8  
 11  Title_Capt          1309 non-null   uint8  
 12  Title_Col           1309 non-null   uint8  
 13  Title_Don           1309 non-null   uint8  
 14  Title_Dona          1309 non-null   uint8  
 15  Title_Dr            1309 non-null   uint8  
 16  Title_Jonkheer      1309 non-null   uint8  
 17  Title_Lady          1309 non-null   uint8  
 18  Title_Major         1309 non-null   uint8  
 19  Title_Master        1309 non-null   uint8  
 20  Title_Miss          1309 non-null   uint8  
 21  Title_Mlle          1309 non-null   uint8  
 22  Title_Mme           1309 non-null   uint8  
 23  Title_Mr            1309 non-null   uint8  
 24  Title_Mrs           1309 non-null   uint8  
 25  Title_Ms            1309 non-null   uint8  
 26  Title_Rev           1309 non-null   uint8  
 27  Title_Sir           1309 non-null   uint8  
 28  Title_the Countess  1309 non-null   uint8  
 29  Alone               1309 non-null   int64  
 30  family_small        1309 non-null   int64  
 31  family_big          1309 non-null   int64  
 32   Age_1.0            1309 non-null   uint8  
 33   Age_2.0            1309 non-null   uint8  
 34   Age_3.0            1309 non-null   uint8  
 35   Age_4.0            1309 non-null   uint8  
 36   Age_5.0            1309 non-null   uint8  
 37  Fare_1.0            1309 non-null   uint8  
 38  Fare_2.0            1309 non-null   uint8  
 39  Fare_3.0            1309 non-null   uint8  
 40  Fare_4.0            1309 non-null   uint8  
 41  Fare_5.0            1309 non-null   uint8  
dtypes: float64(1), int64(6), object(1), uint8(34)
memory usage: 125.4+ KB

Out[46]:

Survived              1.000000
Title_Mrs             0.339040
Title_Miss            0.327093
Pclass_1              0.285904
family_small          0.279855
Fare_5.0              0.266217
Embarked_C            0.168240
 Age_1.0              0.121485
Pclass_2              0.093349
Title_Master          0.085221
Title_Mlle            0.060095
Fare_4.0              0.058052
Fare_3.0              0.043153
Title_Lady            0.042470
Title_Mme             0.042470
Title_Sir             0.042470
Title_Ms              0.042470
Title_the Countess    0.042470
 Age_4.0              0.030350
 Age_3.0              0.021711
Title_Major           0.011329
Title_Col             0.011329
Title_Dr              0.008185
Embarked_Q            0.003650
PassengerId          -0.005007
Title_Don            -0.026456
Title_Jonkheer       -0.026456
Title_Capt           -0.026456
Title_Rev            -0.064988
 Age_5.0             -0.067344
 Age_2.0             -0.097245
family_big           -0.125147
Embarked_S           -0.149683
Fare_1.0             -0.164287
Fare_2.0             -0.198067
Alone                -0.203367
Pclass_3             -0.322308
Sex                  -0.543351
Title_Mr             -0.549199
Cabin_exist                NaN
Title_Dona                 NaN
Name: Survived, dtype: float64

选择特征

full_x=pd.concat([TitleDf,PclassDf,EmbarkedDf,FareDf,AgeDf,full['Sex'],full['family_small'],full['Cabin_exist']
                  ,full['family_big'],full['Alone']],axis=1)
full_x.head()

分割训练数据和测试数据


#前891行为原始训练数据
source_x=full_x.loc[0:890,:]#提取特征值
source_y=full.loc[0:890,'Survived']#提取标签值
 
#后418行是test,预测数据
pred_x=full_x.loc[891:,:]
source_x.shape#(891, 23)
source_y.shape#(891,)
pred_x.shape#(418, 23)

构建训练集和数据集

#对x_train,x_test进行标准化
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train_std=sc.fit_transform(x_train)
x_test_std=sc.transform(x_test)
#训练数据集和测试数据集,按照二八原则分为训练数据和测试数据,其中80%为训练数据
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test =train_test_split(source_x,source_y,train_size=0.8)
print('训练数据集特征:{0},训练数据集标签:{1}'.format(x_train.shape,y_train.shape))
print('测试数据集特征:{0},测试数据集标签:{1}'.format(x_test.shape,y_test.shape))

不同模型对比

from sklearn.model_selection import cross_val_score
 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
 
models=[KNeighborsClassifier(),GaussianNB(),DecisionTreeClassifier(),RandomForestClassifier(),
       GradientBoostingClassifier(),SVC()]
 
# 计算各模型得分
names=['KNN','NB','Tree','RF','GDBT','SVM']
for name, model in zip(names,models):
    score=cross_val_score(model , x_train , y_train ,cv=5)
    print("{}:{},{}".format(name,score.mean(),score))
KNN:0.8174431202600216,[0.79020979 0.82517483 0.8028169  0.84507042 0.82394366]
NB:0.7079976361666502,[0.62237762 0.6993007  0.76056338 0.67605634 0.78169014]
Tree:0.8244558258642766,[0.7972028  0.83916084 0.79577465 0.85915493 0.83098592]
RF:0.818831872352999,[0.7972028  0.83216783 0.79577465 0.85211268 0.81690141]
GDBT:0.8230473751600511,[0.8041958  0.83216783 0.79577465 0.83098592 0.85211268]
SVM:0.837082635674185,[0.8041958  0.86713287 0.8028169  0.83802817 0.87323944]
# 使用标准化的数据 scaled data
names=['KNN','NB','Tree','RF','GDBT','SVM']
for name, model in zip(names,models):
    score=cross_val_score(model,x_train_std,y_train,cv=5)
    print("{}:{},{}".format(name,score.mean(),score))
KNN:0.8258347286516301,[0.81818182 0.83916084 0.80985915 0.82394366 0.83802817]
NB:0.6798975672215108,[0.61538462 0.65734266 0.77464789 0.61971831 0.73239437]
Tree:0.8230473751600511,[0.7972028  0.83916084 0.79577465 0.85211268 0.83098592]
RF:0.8188121737417513,[0.8041958  0.83916084 0.78169014 0.83802817 0.83098592]
GDBT:0.8230473751600511,[0.8041958  0.83216783 0.79577465 0.83098592 0.85211268]
SVM:0.8188121737417513,[0.79020979 0.85314685 0.78169014 0.82394366 0.84507042]

通过决策树看各个特征的重要性

clf = DecisionTreeClassifier(criterion = 'entropy',random_state = 30,splitter = 'random')
clf.fit(x_train_std,y_train)
score = clf.score(x_test_std,y_test)
score
 
 
fi=pd.DataFrame({'importance':clf.feature_importances_},index=x_train.columns)
fi.sort_values('importance',ascending=False)
Sex0.386930
Pclass_30.141619
family_big0.093548
Title_Master0.047530
Pclass_10.042795
Alone0.034266
Fare_2.00.025950
Pclass_20.024562
Embarked_S0.022365
Fare_1.00.021891
Age_2.00.017045
Title_Mr0.017019
Fare_4.00.015637
Age_3.00.014921
Age_4.00.012283
Fare_5.00.011596
Embarked_C0.011315
family_small0.010904
Title_Mrs0.009843
Embarked_Q0.008676
Fare_3.00.007491
Age_5.00.007221
Title_Miss0.004979
Title_Sir0.003193
Title_Col0.002961
Age_1.00.001737
Title_Rev0.001293
Title_Dr0.000430
Title_Don0.000000
Title_Dona0.000000
Cabin_exist0.000000
Title_Major0.000000
Title_Jonkheer0.000000
Title_Lady0.000000
Title_the Countess0.000000
Title_Ms0.000000
Title_Mme0.000000
Title_Mlle0.000000
Title_Capt0.000000
fi.sort_values('importance',ascending=False).plot.bar(figsize=(11,7))
plt.xticks(rotation=30)
plt.title('Feature Importance',size='x-large')

 
from sklearn.model_selection import GridSearchCV
 
param_grid={'n_neighbors':[1,2,3,4,5,6,7,8,9]}
grid_search=GridSearchCV(KNeighborsClassifier(),param_grid,cv=5)
 
grid_search.fit(x_train_std,y_train)
 
grid_search.best_params_,grid_search.best_score_


#LogisticRegression
from sklearn.linear_model import LogisticRegression
param_grid={'C':[0.01,0.1,1,10]}
grid_search=GridSearchCV(LogisticRegression(),param_grid,cv=5)
 
grid_search.fit(x_train_std,y_train)
 
grid_search.best_params_,grid_search.best_score_
 
# second round grid search
param_grid={'C':[0.04,0.06,0.08,0.1,0.12,0.14]}
grid_search=GridSearchCV(LogisticRegression(),param_grid,cv=5)
 
grid_search.fit(x_train_std,y_train)
 
grid_search.best_params_,grid_search.best_score_


#Support Vector Machine
 
param_grid={'C':[0.01,0.1,1,10],'gamma':[0.01,0.1,1,10]}
grid_search=GridSearchCV(SVC(),param_grid,cv=5)
 
grid_search.fit(x_train_std,y_train)
 
grid_search.best_params_,grid_search.best_score_
 
 
#second round grid search
param_grid={'C':[2,4,6,8,10,12,14],'gamma':[0.008,0.01,0.012,0.015,0.02]}
grid_search=GridSearchCV(SVC(),param_grid,cv=5)
 
grid_search.fit(x_train_std,y_train)
 
grid_search.best_params_,grid_search.best_score_


#Gradient Boosting Decision Tree
 
param_grid={'n_estimators':[30,50,80,120,200],'learning_rate':[0.05,0.1,0.5,1],'max_depth':[1,2,3,4,5]}
grid_search=GridSearchCV(GradientBoostingClassifier(),param_grid,cv=5)
 
grid_search.fit(x_train_std,y_train)
 
grid_search.best_params_,grid_search.best_score_
 
#second round grid search
param_grid={'C':[2,4,6,8,10,12,14],'gamma':[0.008,0.01,0.012,0.015,0.02]}
grid_search=GridSearchCV(SVC(),param_grid,cv=5)
 
grid_search.fit(x_train_std,y_train)
 
grid_search.best_params_,grid_search.best_score_

 模型训练

#逻辑回归
from sklearn.linear_model import LogisticRegression
model1 = LogisticRegression()
model1.fit(x_train_std,y_train)
 
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
              intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
              penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
              verbose=0, warm_start=False)
 
#在测试集上得出模型正确率
model1.score(x_test_std,y_test)
 

 

0.8212290502793296

pred1 = model1.predict(x_test)
score1 = model1.score(x_train,y_train)
score1
#使用训练得到的模型对pred_x的生存情况进行预测
pred_x_std=sc.fit_transform(pred_x)
pred_y=model1.predict(pred_x_std)
pred_y
array([0., 1., 0., 0., 1., 0., 1., 0., 1., 0., 0., 0., 1., 0., 1., 1., 0.,
       0., 0., 1., 0., 1., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 1.,
       0., 0., 1., 1., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 1., 1., 0.,
       0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 1., 1., 1., 1.,
       0., 0., 1., 0., 1., 0., 1., 1., 0., 1., 0., 1., 1., 0., 0., 0., 0.,
       0., 1., 1., 1., 1., 1., 0., 1., 0., 0., 0., 0., 0., 1., 0., 1., 0.,
       0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 0., 0., 1., 1.,
       1., 1., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0.,
       0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0.,
       1., 0., 0., 1., 1., 0., 1., 1., 1., 1., 0., 0., 1., 1., 0., 1., 1.,
       0., 0., 0., 0., 0., 1., 1., 0., 1., 1., 0., 0., 1., 0., 1., 0., 1.,
       0., 0., 0., 0., 0., 1., 0., 1., 0., 1., 1., 0., 1., 1., 1., 1., 1.,
       0., 0., 1., 0., 1., 0., 0., 0., 0., 1., 0., 1., 1., 0., 1., 0., 1.,
       0., 1., 0., 1., 1., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
       1., 1., 1., 1., 0., 0., 1., 0., 1., 0., 1., 1., 1., 0., 0., 0., 0.,
       0., 0., 0., 1., 0., 0., 0., 1., 1., 0., 0., 0., 0., 1., 0., 1., 0.,
       1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 1., 1., 0., 0., 0., 0.,
       0., 0., 1., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 1., 1.,
       0., 1., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0.,
       0., 1., 0., 1., 1., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 1.,
       0., 0., 0., 1., 1., 1., 0., 1., 0., 1., 1., 0., 0., 0., 1., 0., 1.,
       0., 0., 1., 0., 1., 1., 0., 1., 0., 0., 1., 1., 0., 0., 1., 0., 0.,
       1., 1., 0., 0., 0., 0., 0., 0., 1., 1., 0., 1., 0., 0., 0., 0., 1.,
       1., 1., 0., 0., 1., 0., 1., 0., 0., 1., 0., 1., 1., 0., 0., 0., 0.,
       1., 1., 1., 1., 1., 0., 1., 0., 0., 1.])


 
 

  • 4
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
好的,以下是使用Python进行预测泰坦尼克号乘客生存率的步骤: 1. 导入数据集 首先,我们需要导入泰坦尼克号数据集,可以使用Pandas库中的read_csv函数进行导入。例如: ```python import pandas as pd # 读取训练集和测试集数据 train_data = pd.read_csv('train.csv') test_data = pd.read_csv('test.csv') ``` 2. 数据预处理 接着,我们需要对数据进行预处理,包括填补缺失值、转换类别变量为数值变量、选择特征等等。 ```python # 填补缺失值 train_data['Age'].fillna(train_data['Age'].median(), inplace=True) test_data['Age'].fillna(test_data['Age'].median(), inplace=True) train_data['Embarked'].fillna(train_data['Embarked'].mode()[0], inplace=True) test_data['Fare'].fillna(test_data['Fare'].median(), inplace=True) # 转换类别变量为数值变量 train_data['Sex'] = train_data['Sex'].map({'female': 0, 'male': 1}) test_data['Sex'] = test_data['Sex'].map({'female': 0, 'male': 1}) train_data['Embarked'] = train_data['Embarked'].map({'C': 0, 'Q': 1, 'S': 2}) test_data['Embarked'] = test_data['Embarked'].map({'C': 0, 'Q': 1, 'S': 2}) # 选择特征 features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'] x_train = train_data[features] y_train = train_data['Survived'] x_test = test_data[features] ``` 3. 模型训练和预测 接下来,我们可以使用一些经典的分类算法,比如决策树、随机森林、逻辑回归等等,对数据进行训练和验证,以得到一个准确的模型。 这里以随机森林为例进行训练和预测。 ```python from sklearn.ensemble import RandomForestClassifier # 定义随机森林模型 rfc = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1) # 训练模型 rfc.fit(x_train, y_train) # 预测测试集结果 y_pred = rfc.predict(x_test) ``` 4. 提交结果 最后,我们可以将预测结果进行提交,参加kaggle竞赛。 ```python # 将结果保存为csv文件 submission = pd.DataFrame({'PassengerId': test_data['PassengerId'], 'Survived': y_pred}) submission.to_csv('submission.csv', index=False) ``` 以上就是使用Python进行预测泰坦尼克号乘客生存率的步骤,希望能对你有所帮助。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

荔枝味啊~

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值