titanic的一个解法

最新推荐文章于 2022-05-08 23:46:14 发布

我是无名的我

最新推荐文章于 2022-05-08 23:46:14 发布

阅读量208

点赞数

分类专栏： sklearn kaggle 文章标签： Kaggle Sklearn titanic

本文链接：https://blog.csdn.net/qq_39821554/article/details/89054638

版权

sklearn 同时被 2 个专栏收录

3 篇文章 0 订阅

订阅专栏

kaggle

1 篇文章 0 订阅

订阅专栏

本文参考了Titanic Data Science Solutions和第一次Kaggle项目——泰坦尼克号的一些代码思路。

步骤

问题提出
数据的导入与整理
特征工程
所在家庭大小（在船上的）
除去空值过多的Cabin，信息杂乱的Ticket，PassengerId也是不需要的。
分离出train、test
机器学习
- 计算分数
- 结果
一些发现

问题提出

什么样的人在泰坦尼克号中更容易存活？

数据的导入与整理

本文要用到的数据在Kaggle，然后利用pandas进行数据的导入。

import pandas as pd
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

缺失值的存在极大的影响了数据的使用，所以我们先利用.info()查看数据的缺失值。

train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

train数据集中，Age、Cabin和Embarked需要进行数据的补全。

test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

test数据集中，Age、Fare和Cabin需要进行数据的补全。

合并数据集，同时对两个数据集进行清洗

combine = pd.concat([train_df, test_df], axis = 0)

特征工程

import numpy as np 
#消除随机
np.random.seed()

Age的处理

随机生产正态分布数列，均值：np.mean() 方差np.std()

Age_null = combine[combine['Age'].isna()]
Age_null['Age'] = np.random.normal(np.mean(combine['Age']), np.std(combine['Age'])\
        , (Age_null.shape[0], 1))
#Age_null['Age'] = Age_null['Age'].apply(round) 
Age_notnull = combine[combine['Age'].notna()]
combine = pd.concat([Age_null, Age_notnull], axis = 0)

Fare的处理

Fare_null = combine[combine['Fare'].isna()]
Fare_null['Fare'] = np.random.normal(np.mean(combine['Fare']), np.std(combine['Fare']), \
         (Fare_null.shape[0], 1))
#Fare_null['Fare'] = Fare_null['Fare'].apply(round)
Fare_notnull = combine[combine['Fare'].notna()]
combine = pd.concat([Fare_null, Fare_notnull], axis = 0)

Embarked的处理，'S’最常见

combine['Embarked'] = combine['Embarked'].fillna('S')

sex的0-1

sex = {'male': 1, 'female': 0}
combine['Sex'] = combine['Sex'].map(sex)

登船港口Embarked进行one-hot编码，同时把原本的变量删除

data_Embark = pd.get_dummies(combine['Embarked'], prefix = 'Embarked')
combine = pd.concat([data_Embark, combine], axis = 1)
combine = combine.drop('Embarked', axis = 1)

客舱等级Pclass进行one-hot编码，同时把原本的变量删除

data_Pclass = pd.get_dummies(combine['Pclass'], prefix = 'Pclass')
combine = pd.concat([data_Pclass, combine], axis = 1)
combine = combine.drop('Pclass', axis = 1)

名字

整理名字的含义

combine['NameTitle'] = combine.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
combine['NameTitle'] = combine['NameTitle'].replace(['Lady', 'Countess','Capt', 'Col',\
 	'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
combine['NameTitle'] = combine['NameTitle'].replace(['Mlle', 'Ms'], 'Miss')
combine['NameTitle'] = combine['NameTitle'].replace('Mme', 'Mrs')
combine = combine.drop('Name', axis = 1)

对名字进行one-hot编码，同时把原本的变量删除

data_NameTitle = pd.get_dummies(combine['NameTitle'], prefix = 'NameTitle')
combine = pd.concat([data_NameTitle, combine], axis = 1)
combine = combine.drop('NameTitle', axis = 1)

所在家庭大小（在船上的）

combine['FamilySize'] = combine['SibSp'] + combine['Parch'] + 1 
combine = combine.drop(['SibSp', 'Parch'], axis = 1)

除去空值过多的Cabin，信息杂乱的Ticket，PassengerId也是不需要的。

combine = combine.drop(['Cabin', 'PassengerId', 'Ticket'], axis = 1)

分离出train、test

train = combine[combine['Survived'].notna()]
test = combine[combine['Survived'].isna()].drop('Survived', axis=1)

X_train = train.drop('Survived', axis = 1)
Y_train = train['Survived']
X_test = test

机器学习

from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

svc = SVC()
svc.fit(X_train, Y_train)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)

knn = KNeighborsClassifier(n_neighbors = 33)
knn.fit(X_train, Y_train)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)

gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)
acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)

perceptron = Perceptron()
perceptron.fit(X_train, Y_train)
acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)


linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)
acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)

sgd = SGDClassifier()
sgd.fit(X_train, Y_train)
acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)

random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)

计算分数

models = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 
              'Random Forest', 'Naive Bayes', 'Perceptron', 
              'Stochastic Gradient Decent', 'Linear SVC', 
              'Decision Tree'],
    'Score': [acc_svc, acc_knn, 
              acc_random_forest, acc_gaussian, acc_perceptron, 
              acc_sgd, acc_linear_svc, acc_decision_tree]})

结果

	Model	Score
0	Support Vector Machines	88.55
1	KNN	72.62
2	Random Forest	99.10
3	Naive Bayes	79.91
4	Perceptron	58.59
5	Stochastic Gradient Decent	73.51
6	Linear SVC	82.04
7	Decision Tree	99.10