本文参考了Titanic Data Science Solutions和第一次Kaggle项目——泰坦尼克号的一些代码思路。
步骤
问题提出
什么样的人在泰坦尼克号中更容易存活?
数据的导入与整理
本文要用到的数据在Kaggle,然后利用pandas
进行数据的导入。
import pandas as pd
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
缺失值的存在极大的影响了数据的使用,所以我们先利用.info()
查看数据的缺失值。
train_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
train数据集中,Age
、Cabin
和Embarked
需要进行数据的补全。
test_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId 418 non-null int64
Pclass 418 non-null int64
Name 418 non-null object
Sex 418 non-null object
Age 332 non-null float64
SibSp 418 non-null int64
Parch 418 non-null int64
Ticket 418 non-null object
Fare 417 non-null float64
Cabin 91 non-null object
Embarked 418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
test数据集中,Age
、Fare
和Cabin
需要进行数据的补全。
合并数据集,同时对两个数据集进行清洗
combine = pd.concat([train_df, test_df], axis = 0)
特征工程
import numpy as np
#消除随机
np.random.seed()
Age的处理
随机生产正态分布数列,均值:np.mean()
方差np.std()
Age_null = combine[combine['Age'].isna()]
Age_null['Age'] = np.random.normal(np.mean(combine['Age']), np.std(combine['Age'])\
, (Age_null.shape[0], 1))
#Age_null['Age'] = Age_null['Age'].apply(round)
Age_notnull = combine[combine['Age'].notna()]
combine = pd.concat([Age_null, Age_notnull], axis = 0)
Fare的处理
Fare_null = combine[combine['Fare'].isna()]
Fare_null['Fare'] = np.random.normal(np.mean(combine['Fare']), np.std(combine['Fare']), \
(Fare_null.shape[0], 1))
#Fare_null['Fare'] = Fare_null['Fare'].apply(round)
Fare_notnull = combine[combine['Fare'].notna()]
combine = pd.concat([Fare_null, Fare_notnull], axis = 0)
Embarked的处理,'S’最常见
combine['Embarked'] = combine['Embarked'].fillna('S')
sex的0-1
sex = {'male': 1, 'female': 0}
combine['Sex'] = combine['Sex'].map(sex)
登船港口Embarked进行one-hot编码,同时把原本的变量删除
data_Embark = pd.get_dummies(combine['Embarked'], prefix = 'Embarked')
combine = pd.concat([data_Embark, combine], axis = 1)
combine = combine.drop('Embarked', axis = 1)
客舱等级Pclass进行one-hot编码,同时把原本的变量删除
data_Pclass = pd.get_dummies(combine['Pclass'], prefix = 'Pclass')
combine = pd.concat([data_Pclass, combine], axis = 1)
combine = combine.drop('Pclass', axis = 1)
名字
整理名字的含义
combine['NameTitle'] = combine.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
combine['NameTitle'] = combine['NameTitle'].replace(['Lady', 'Countess','Capt', 'Col',\
'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
combine['NameTitle'] = combine['NameTitle'].replace(['Mlle', 'Ms'], 'Miss')
combine['NameTitle'] = combine['NameTitle'].replace('Mme', 'Mrs')
combine = combine.drop('Name', axis = 1)
对名字进行one-hot编码,同时把原本的变量删除
data_NameTitle = pd.get_dummies(combine['NameTitle'], prefix = 'NameTitle')
combine = pd.concat([data_NameTitle, combine], axis = 1)
combine = combine.drop('NameTitle', axis = 1)
所在家庭大小(在船上的)
combine['FamilySize'] = combine['SibSp'] + combine['Parch'] + 1
combine = combine.drop(['SibSp', 'Parch'], axis = 1)
除去空值过多的Cabin,信息杂乱的Ticket,PassengerId也是不需要的。
combine = combine.drop(['Cabin', 'PassengerId', 'Ticket'], axis = 1)
分离出train、test
train = combine[combine['Survived'].notna()]
test = combine[combine['Survived'].isna()].drop('Survived', axis=1)
X_train = train.drop('Survived', axis = 1)
Y_train = train['Survived']
X_test = test
机器学习
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
svc = SVC()
svc.fit(X_train, Y_train)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
knn = KNeighborsClassifier(n_neighbors = 33)
knn.fit(X_train, Y_train)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)
acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)
perceptron = Perceptron()
perceptron.fit(X_train, Y_train)
acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)
linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)
acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)
sgd = SGDClassifier()
sgd.fit(X_train, Y_train)
acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
计算分数
models = pd.DataFrame({
'Model': ['Support Vector Machines', 'KNN',
'Random Forest', 'Naive Bayes', 'Perceptron',
'Stochastic Gradient Decent', 'Linear SVC',
'Decision Tree'],
'Score': [acc_svc, acc_knn,
acc_random_forest, acc_gaussian, acc_perceptron,
acc_sgd, acc_linear_svc, acc_decision_tree]})
结果
Model | Score | |
---|---|---|
0 | Support Vector Machines | 88.55 |
1 | KNN | 72.62 |
2 | Random Forest | 99.10 |
3 | Naive Bayes | 79.91 |
4 | Perceptron | 58.59 |
5 | Stochastic Gradient Decent | 73.51 |
6 | Linear SVC | 82.04 |
7 | Decision Tree | 99.10 |
一些发现
- 在填补缺失值的过程中,对Age和Fare的取整并不能来ACC分数的提升
- 可以对Age和Fare进行one-hot编码,但是跟上一个发现一样,ACC分数降低。