kaggle系列(一、Titanic入门比赛)

背景介绍

发生在1912年的泰坦尼克事件,导致船上2224名游客阵亡1502(我们的男主角也牺牲了),我们掌握船上乘客的一些数据以及一部分乘客是否获救的信息。我们希望能通过探索这些数据,发现一些不为人知的秘密,顺便预测下另外一部分乘客是否能够获救!

数据导入与分析

导入有用的包

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

from collections import Counter

from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier,\
      ExtraTreesClassifier,VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB 
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold, learning_curve

导入数据

In [2]:
train = pd.read_csv("C:/Code/Kaggle/Titanic/train.csv")
test = pd.read_csv("C:/Code/Kaggle/Titanic/test.csv")
IDtest = test["PassengerId"]

去除离群点

In [3]:
def detect_outliers(df,n,features):
    outlier_indices = []
    for col in features:
        Q1 = np.percentile(df[col],25)
        Q3 = np.percentile(df[col],75)
        IQR = Q3 - Q1
        outlier_step = 1.5 * IQR
        outlier_list_col = df[(df[col] < Q1 - outlier_step) | (df[col] > Q3 + outlier_step)].index
        outlier_indices.extend(outlier_list_col)

    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list(k for k, v in outlier_indices.items() if v>n)
    return multiple_outliers
Outliers_to_drop = detect_outliers(train,2,["Age","SibSp","Parch","Fare"])
In [4]:
train.loc[Outliers_to_drop]
Out[4]:
  PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
27 28 0 1 Fortune, Mr. Charles Alexander male 19.0 3 2 19950 263.00 C23 C25 C27 S
88 89 1 1 Fortune, Miss. Mabel Helen female 23.0 3 2 19950 263.00 C23 C25 C27 S
159 160 0 3 Sage, Master. Thomas Henry male NaN 8 2 CA. 2343 69.55 NaN S
180 181 0 3 Sage, Miss. Constance Gladys female NaN 8 2 CA. 2343 69.55 NaN S
201 202 0 3 Sage, Mr. Frederick male NaN 8 2 CA. 2343 69.55 NaN S
792 793 0 3 Sage, Miss. Stella Anna female NaN 8 2 CA. 2343 69.55 NaN S
324 325 0 3 Sage, Mr. George John Jr male NaN 8 2 CA. 2343 69.55 NaN S
846 847 0 3 Sage, Mr. Douglas Bullen male NaN 8 2 CA. 2343 69.55 NaN S
341 342 1 1 Fortune, Miss. Alice Elizabeth female 24.0 3 2 19950 263.00 C23 C25 C27 S
863 864 0 3 Sage, Miss. Dorothy Edith "Dolly" female NaN 8 2 CA. 2343 69.55 NaN S
In [5]:
train = train.drop(Outliers_to_drop,axis=0).reset_index(drop=True)

连接训练数据和测试数据

In [6]:
train_len = len(train)
dataset = pd.concat([train,test], axis=0).reset_index(drop=True)
dataset.tail()
Out[6]:
  Age Cabin Embarked Fare Name Parch PassengerId Pclass Sex SibSp Survived Ticket
1294 NaN NaN S 8.0500 Spector, Mr. Woolf 0 1305 3 male 0 NaN A.5. 3236
1295 39.0 C105 C 108.9000 Oliva y Ocana, Dona. Fermina 0 1306 1 female 0 NaN PC 17758
1296 38.5 NaN S 7.2500 Saether, Mr. Simon Sivertsen 0 1307 3 male 0 NaN SOTON/O.Q. 3101262
1297 NaN NaN S 8.0500 Ware, Mr. Frederick 0 1308 3 male 0 NaN 359309
1298 NaN NaN C 22.3583 Peter, Master. Michael J 1 1309 3 male 1 NaN 2668

查看缺失值

In [7]:
#dataset = dataset.fillna(np.nan)
dataset.isnull().sum()
Out[7]:
Age             256
Cabin          1007
Embarked          2
Fare              1
Name              0
Parch             0
PassengerId       0
Pclass            0
Sex               0
SibSp             0
Survived        418
Ticket            0
dtype: int64
In [8]:
train.info()
train.isnull().sum()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 881 entries, 0 to 880
Data columns (total 12 columns):
PassengerId    881 non-null int64
Survived       881 non-null int64
Pclass         881 non-null int64
Name           881 non-null object
Sex            881 non-null object
Age            711 non-null float64
SibSp          881 non-null int64
Parch          881 non-null int64
Ticket         881 non-null object
Fare           881 non-null float64
Cabin          201 non-null object
Embarked       879 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 82.7+ KB
Out[8]:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            170
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          680
Embarked         2
dtype: int64
In [9]:
train.describe()
Out[9]:
  PassengerId Survived Pclass Age SibSp Parch Fare
count 881.000000 881.000000 881.000000 711.000000 881.000000 881.000000 881.000000
mean 446.713961 0.385925 2.307605 29.731603 0.455165 0.363224 31.121566
std 256.617021 0.487090 0.835055 14.547835 0.871571 0.791839 47.996249
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 226.000000 0.000000 2.000000 20.250000 0.000000 0.000000 7.895800
50% 448.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.000000 1.000000 3.000000 38.000000 1.000000 0.000000 30.500000
max 891.000000 1.000000 3.000000 80.000000 5.000000 6.000000 512.329200

特征分析与数据前处理

数值变量

In [10]:
g = sns.heatmap(train[["Survived","SibSp","Parch","Age","Fare"]].corr(), annot=True, fmt=".2f", cmap = "coolwarm")

Explore SibSp feature vs Survived

In [11]:
g = sns.factorplot(x="SibSp",y="Survived",data=train,kind="bar")
g = g.set_ylabels("survival probability")

Explore Parch feature vs Survived

In [12]:
g  = sns.factorplot(x="Parch",y="Survived",data=train,kind="bar")
g = g.set_ylabels("survival probability")

Explore Age distibution

In [13]:
g = sns.kdeplot(train["Age"][(train["Survived"] == 0) & (train["Age"].notnull())], color="Red", shade = True)
g = sns.kdeplot(train["Age"][(train["Survived"] == 1) & (train["Age"].notnull())], ax =g, color="Blue", shade= True)
g.set_xlabel("Age")
g.set_ylabel("Frequency")
g = g.legend(["Not Survived","Survived"])
Age缺失值填补
  • 0
    点赞
  • 13
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值