Kaggle -Titanic Survival Predictions 2笔记

一、导入必要的库

首先,我们需要导入几个 Python 库,例如 numpy、pandas、matplotlib 和 seaborn

二、探索数据

describe()函数

三、数据分析

考虑数据集的特征及其完整性,了解数据特征和数据类型:

数值特征:Age(连续)、Fare(连续)、SibSp(离散)、Parch(离散)
分类特征:Survived、Sex、Embarked、Pclass 	
字母数字特征:Ticket、Cabin

Int型:Survived、Pclass、SibSp、Parch
String型:Name、Sex、Ticket、Cabin、Embarked 	
Float型:Age、Fare

四、数据可视化

Sex Feature & Survived

#draw a bar plot of survival by sex
sns.barplot(x="Sex", y="Survived", data=train)

#print percentages of females vs. males that survive
print("Percentage of females who survived:", train["Survived"][train["Sex"] == 'female'].value_counts(normalize = True)[1]*100)
print("Percentage of males who survived:", train["Survived"][train["Sex"] == 'male'].value_counts(normalize = True)[1]*100)

在这里插入图片描述

Pclass & Survived

#draw a bar plot of survival by Pclass
sns.barplot(x="Pclass", y="Survived", data=train)

#print percentage of people by Pclass that survived
print("Percentage of Pclass = 1 who survived:", train["Survived"][train["Pclass"] == 1].value_counts(normalize = True)[1]*100)
print("Percentage of Pclass = 2 who survived:", train["Survived"][train["Pclass"] == 2].value_counts(normalize = True)[1]*100)
print("Percentage of Pclass = 3 who survived:", train["Survived"][train["Pclass"] == 3].value_counts(normalize = True)[1]*100)

在这里插入图片描述

SibSp & Survived

#draw a bar plot for SibSp vs. survival
sns.barplot(x="SibSp", y="Survived", data=train)

#I won't be printing individual percent values for all of these.
print("Percentage of SibSp = 0 who survived:", train["Survived"][train["SibSp"] == 0].value_counts(normalize = True)[1]*100)
print("Percentage of SibSp = 1 who survived:", train["Survived"][train["SibSp"] == 1].value_counts(normalize = True)[1]*100)
print("Percentage of SibSp = 2 who survived:", train["Survived"][train["SibSp"] == 2].value_counts(normalize = True)[1]*100)

在这里插入图片描述

Parch & Survived

#draw a bar plot for Parch vs. survival
sns.barplot(x="Parch", y="Survived", data=train)
plt.show()

在这里插入图片描述

Age & Survived

#sort the ages into logical categories
train["Age"] = train["Age"].fillna(-0.5)
test["Age"] = test["Age"].fillna(-0.5)
bins = [-1, 0, 5, 12, 18, 24, 35, 60, np.inf]
labels = ['Unknown', 'Baby', 'Child', 'Teenager', 'Student', 'Young Adult', 'Adult', 'Senior']
train['AgeGroup'] = pd.cut(train["Age"], bins, labels = labels)
test['AgeGroup'] = pd.cut(test["Age"], bins, labels = labels)

#draw a bar plot of Age vs. survival
sns.barplot(x="AgeGroup", y="Survived", data=train)
plt.show()
将训练集和测试集中的Age照逻辑类别进行分类:
①先用-0.5填充缺失的年龄值;
②再定义了年龄分组的边界和标签;
③最后用pd.cut()函数将年龄分为不同的组,并将其映射为相应的标签AgeGroup

在这里插入图片描述

Cabin & Survived

train["CabinBool"] = (train["Cabin"].notnull().astype('int'))
test["CabinBool"] = (test["Cabin"].notnull().astype('int'))

#calculate percentages of CabinBool vs. survived
print("Percentage of CabinBool = 1 who survived:", train["Survived"][train["CabinBool"] == 1].value_counts(normalize = True)[1]*100)
print("Percentage of CabinBool = 0 who survived:", train["Survived"][train["CabinBool"] == 0].value_counts(normalize = True)[1]*100)

#draw a bar plot of CabinBool vs. survival
sns.barplot(x="CabinBool", y="Survived", data=train)
plt.show()
计算训练集和测试集中有船舱号(Cabin)和没有船舱号的乘客中,生存率的百分比。
①先创建了一个新的特征CabinBool,如果乘客有船舱号,则该特征值为1,否则为0
②然后分别计算了有船舱号和没有船舱号的乘客中,生存率为1的百分比

在这里插入图片描述

五、数据清洗—处理缺失值

  1. Cabin Feature和Ticket Feature:删除Cabin列和Ticket列

  2. Embarked Feature:用出现最频繁的“S”替换Embarked列的缺失值;将训练集和测试集中的Embarked特征映射为数值,S为1,C为2,Q为3

    #now we need to fill in the missing values in the Embarked feature
    print("Number of people embarking in Southampton (S):")
    southampton = train[train["Embarked"] == "S"].shape[0]
    print(southampton)
    
    print("Number of people embarking in Cherbourg (C):")
    cherbourg = train[train["Embarked"] == "C"].shape[0]
    print(cherbourg)
    
    print("Number of people embarking in Queenstown (Q):")
    queenstown = train[train["Embarked"] == "Q"].shape[0]
    print(queenstown)
    
  3. Name Feature

    ①将训练集和测试集合并到一个名为combine的列表中。
    ②遍历combine列表中的每个数据集,并使用正则表达式从Name字段中提取信息。提取的信息被存储在新的Title字段中。
    ③用pd.crosstab()函数创建一个交叉表,显示每个Title与Sex之间的关系。
    

    正则表达式 ’ ([A-Za-z]+).’ 用于从Name中提取Title。具体解释如下:

    • ( ) 是捕获组的开始和结束标记,用于将匹配到的内容保存起来。
      [A-Za-z] 表示匹配任意一个大写或小写字母。
      表示匹配前面的字符(即字母)一次或多次。
      反斜杠 \. 表示匹配一个句点(.),需要使用反斜杠 \ 进行转义,因为句点在正则表达式中有特殊含义。
    • 因此,这个正则表达式的含义是匹配以空格开头,后面跟着一个或多个字母,并以句点结尾的字符串。通过使用 str.extract()函数,可以从每个名字中提取出符合该模式的部分作为Title,最后删除Name列(因为已经提取了Title特征)
    #create a combined group of both datasets
    combine = [train, test]
    #extract a title for each Name in the train and test datasets
    for dataset in combine:
    	dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
    pd.crosstab(train['Title'], train['Sex']) 	
    
    将训练集中的Title特征进行分类和映射:
    ①首先,将一些罕见的Title替换为"Rare",将一些皇室头衔替换为"Royal",并将一些常见的头衔进行了简化。
    ②然后,根据Title分组计算生存率,并输出结果。
    ③接下来,定义了一个名为title_mapping的字典,其中包含了不同Title对应的数值。
    ④最后,使用map()函数将训练集中的Title特征映射为相应的数值,并用0填充缺失值。
    
    #replace various titles with more common names
    for dataset in combine:
    	dataset['Title'] = dataset['Title'].replace(['Lady', 'Capt', 'Col',
    'Don', 'Dr', 'Major', 'Rev', 'Jonkheer', 'Dona'], 'Rare')
    	dataset['Title'] = dataset['Title'].replace(['Countess', 'Lady', 'Sir'], 'Royal')
    	dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    	dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    	dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
    train[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()
    
    #map each of the title groups to a numerical value
    title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Royal": 5, "Rare": 6}
    for dataset in combine:
    	dataset['Title'] = dataset['Title'].map(title_mapping)
    	dataset['Title'] = dataset['Title'].fillna(0)
    
  4. Age Feature
    填充训练集和测试集中缺失的年龄特征:
    ①首先,根据Title将年龄分组,并计算每个年龄组的众数。
    ②然后,定义了一个名为age_title_mapping的字典,其中包含了不同Title对应的年龄组。
    ③接下来,遍历训练集和测试集中的年龄组,如果年龄组为"Unknown",则使用对应Title的年龄组进行填充

    #fill missing age with mode age group for each title
    mr_age = train[train["Title"] == 1]["AgeGroup"].mode() #Young Adult
    miss_age = train[train["Title"] == 2]["AgeGroup"].mode() #Student
    mrs_age = train[train["Title"] == 3]["AgeGroup"].mode() #Adult
    master_age = train[train["Title"] == 4]["AgeGroup"].mode() #Baby
    royal_age = train[train["Title"] == 5]["AgeGroup"].mode() #Adult
    rare_age = train[train["Title"] == 6]["AgeGroup"].mode() #Adult
    
    age_title_mapping = {1: "Young Adult", 2: "Student", 3: "Adult", 4: "Baby", 5: "Adult", 6: "Adult"}
    
    #I tried to get this code to work with using .map(), but couldn't.
    #I've put down a less elegant, temporary solution for now.
    #train = train.fillna({"Age": train["Title"].map(age_title_mapping)})
    #test = test.fillna({"Age": test["Title"].map(age_title_mapping)})
    
    for x in range(len(train["AgeGroup"])):
    	if train["AgeGroup"][x] == "Unknown":
        	train["AgeGroup"][x] = age_title_mapping[train["Title"][x]]
        
    for x in range(len(test["AgeGroup"])):
    	if test["AgeGroup"][x] == "Unknown":
        		test["AgeGroup"][x] = age_title_mapping[test["Title"][x]]
    
    将训练集和测试集中的年龄组(AgeGroup)特征映射为数值:
    ①首先,定义了一个名为age_mapping的字典,其中包含了不同年龄组对应的数值。
    ②然后,使用.map()方法将训练集和测试集中的年龄组特征映射为相应的数值。
    ③最后,删除了原始的年龄特征(Age),因为已经使用年龄组特征来表示年龄信息。
    
    #map each Age value to a numerical value
    age_mapping = {'Baby': 1, 'Child': 2, 'Teenager': 3, 'Student': 4, 'Young Adult': 5, 'Adult': 6, 'Senior': 7}
    train['AgeGroup'] = train['AgeGroup'].map(age_mapping)
    test['AgeGroup'] = test['AgeGroup'].map(age_mapping)
    
    #dropping the Age feature for now, might change
    train = train.drop(['Age'], axis = 1)
    test = test.drop(['Age'], axis = 1)
    
  5. Sex Feature:将训练集和测试集中的Sex特征映射为数值,male为0,female为1

    #map each Sex value to a numerical value
    sex_mapping = {"male": 0, "female": 1}
    train['Sex'] = train['Sex'].map(sex_mapping)
    test['Sex'] = test['Sex'].map(sex_mapping)
    
  6. Fare Feature:用相应的Pclass的平均票价进行填充,将票价映射为数值组,并将其分为四个区间,最后删除Fare列。

    #fill in missing Fare value in test set based on mean fare for that Pclass 
    for x in range(len(test["Fare"])):
    	if pd.isnull(test["Fare"][x]):
        	pclass = test["Pclass"][x] #Pclass = 3
        	test["Fare"][x] = round(train[train["Pclass"] == pclass]["Fare"].mean(), 4)
        
    #map Fare values into groups of numerical values
    train['FareBand'] = pd.qcut(train['Fare'], 4, labels = [1, 2, 3, 4])
    test['FareBand'] = pd.qcut(test['Fare'], 4, labels = [1, 2, 3, 4])
    
    #drop Fare values
    train = train.drop(['Fare'], axis = 1)
    test = test.drop(['Fare'], axis = 1)
    

六、模型的选择

拆分训练数据,并用22%的训练数据来测试以下模型

from sklearn.model_selection import train_test_split
predictors = train.drop(['Survived', 'PassengerId'], axis=1)
target = train["Survived"]
x_train, x_val, y_train, y_val = train_test_split(predictors, target, test_size = 0.22, random_state = 0)

测试不同的模型

对于每个模型,用80%的训练数据拟合它,预测20%的训练数据并检查准确

1.Gaussian Naive Bayes 高斯朴素贝叶斯

from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
gaussian = GaussianNB()
gaussian.fit(x_train, y_train)
y_pred = gaussian.predict(x_val)
acc_gaussian = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_gaussian)

2.Logistic Regression 逻辑回归

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(x_train, y_train)
y_pred = logreg.predict(x_val)
acc_logreg = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_logreg)

3.Support Vector Machines 支持向量机

from sklearn.svm import SVC
svc = SVC()
svc.fit(x_train, y_train)
y_pred = svc.predict(x_val)
acc_svc = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_svc)

4.Linear SVC 线性支持向量

from sklearn.svm import LinearSVC
linear_svc = LinearSVC()
linear_svc.fit(x_train, y_train)
y_pred = linear_svc.predict(x_val)
acc_linear_svc = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_linear_svc)

5.Perceptron 感知器

from sklearn.linear_model import Perceptron
perceptron = Perceptron()
perceptron.fit(x_train, y_train)
y_pred = perceptron.predict(x_val)
acc_perceptron = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_perceptron)

6.Decision Tree 决策树

from sklearn.tree import DecisionTreeClassifier
decisiontree = DecisionTreeClassifier()
decisiontree.fit(x_train, y_train)
y_pred = decisiontree.predict(x_val)
acc_decisiontree = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_decisiontree)

7.Random Forest 随机森林

from sklearn.ensemble import RandomForestClassifier
randomforest = RandomForestClassifier()
randomforest.fit(x_train, y_train)
y_pred = randomforest.predict(x_val)
acc_randomforest = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_randomforest)

8.KNN or k-Nearest Neighbors KNN 或 K近邻

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(x_train, y_train)
y_pred = knn.predict(x_val)
acc_knn = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_knn)

9.Stochastic Gradient Descent 随机梯度下降

from sklearn.linear_model import SGDClassifier
sgd = SGDClassifier()
sgd.fit(x_train, y_train)
y_pred = sgd.predict(x_val)
acc_sgd = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_sgd)

10.Gradient Boosting Decision Tree 梯度提升

from sklearn.ensemble import GradientBoostingClassifier
gbk = GradientBoostingClassifier()
gbk.fit(x_train, y_train)
y_pred = gbk.predict(x_val)
acc_gbk = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_gbk)

模型精度比较

models = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes', 'Perceptron', 'Linear SVC', 
              'Decision Tree', 'Stochastic Gradient Descent', 'Gradient Boosting Classifier'],
    'Score': [acc_svc, acc_knn, acc_logreg, 
              acc_randomforest, acc_gaussian, acc_perceptron,acc_linear_svc, acc_decisiontree,
              acc_sgd, acc_gbk]})
models.sort_values(by='Score', ascending=False)

在这里插入图片描述

比较10个模型的精度后决定对测试数据使用Gradient Boosting Classifier模型

七、创建并提交文件

如果您已经走到了这一步,就可以创建一个submission.csv文件上传到Kaggle比赛了!

#set ids as PassengerId and predict survival 
ids = test['PassengerId']
predictions = gbk.predict(test.drop('PassengerId', axis=1))

#set the output as a dataframe and convert to csv file named submission.csv
output = pd.DataFrame({ 'PassengerId' : ids, 'Survived': predictions })
output.to_csv('submission.csv', index=False)

Kaggle -Titanic Survival Predictions

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值