Python與資料分析 5-入門級競賽分析-鐵達尼號

賴博伯

已于 2022-02-05 15:34:59 修改

阅读量2.3k

点赞数 1

分类专栏：上課講義笔记 Matplotlib 文章标签： python 矩阵算法线性代数 sklearn

于 2021-08-04 17:41:55 首次发布

本文链接：https://blog.csdn.net/m0_47985483/article/details/119386583

版权

笔记同时被 3 个专栏收录

48 篇文章 0 订阅

订阅专栏

上課講義

47 篇文章 1 订阅

订阅专栏

Matplotlib

7 篇文章 0 订阅

订阅专栏

“Talk is cheap. Show me the code.”
― Linus Torvalds

老子第41章
上德若谷
大白若辱
大方無隅
大器晚成
大音希聲
大象無形
道隱無名

拳打千遍, 身法自然

“There’s no shortage of remarkable ideas, what’s missing is the will to execute them.” – Seth Godin
「很棒的點子永遠不會匱乏，然而缺少的是執行點子的意志力。」—賽斯．高汀

110_1_高中週期性課程: Python程式入門與資料分析初探, 道明高中

本系列文章之連結

Python程式與數據資料分析1 link
Python程式與數據資料分析1.1 Kaggle站免費教學的路徑圖 link
Python 與數據資料分析2-資料視覺化-Matplotlib.pyplot 入門 link
Python 與數據資料分析3.1-資料視覺化-基本圖表類型 link
Python 與數據資料分析3.2-資料視覺化-從 seabon 的函數分類來看 link
Python與資料分析3.3-資料視覺化-seaborn 補充 link
Python與資料分析4-資料視覺化-鳶尾花 link
Python與資料分析 5-入門級競賽分析-鐵達尼號 link

前言

接著, 我們要示範完整跑完, 整個資料分析的流程, 以 Kaggle 提供之 Titanic 鐵達尼號資料集例子為例,
參考
Kaggle入门，看这一篇就够了 link
介紹的三個入門級競賽例子,
我們照著鐵達尼號的英文那份例子, 全程走一遍
Titanic（泰坦尼克之灾）
英文教程：Helge Bjorland & Stian Eide, An Interactive Data Science Tutorial-Based on the Titanic competition on Kaggle
https://www.kaggle.com/helgejo/an-interactive-data-science-tutorial link

從之前的資料視覺化到這個鐵達尼號的例子分析, 會有一個落差, 在這篇例子裡, 會使用一些的 sklearn 的資料分析(機器學習)的函數方法, 但是可以直接套用不用怕:
DecisionTreeClassifier 決策樹
LogisticRegression 羅吉斯回歸
KNeighborsClassifier k 最近鄰法
GradientBoostingClassifier 梯度提升
等等,
初學可以不用了解原理, 先學會在什麼場景可以使用這些模型方法, 及 sklearn 的指令下法.

初學者, 在第三步 3. 資料前置處理, 這部分, 可以只學會清洗這部分就好, 而變換, 合併等後續步驟, 有點繁瑣, 可以先跳過.

英文這份例子是照 CRISP-DM 的流程演示

CRISP-DM stands for cross-industry process for data mining.

Ref: What is the CRISP-DM methodology? https://www.sv-europe.com/crisp-dm-methodology/ link

CRISP-DM 的流程共有 6 步:

Business understanding 了解業務的內容及目標
Data understanding 了解資料集
Data preparation 資料前置處理(資料清洗, 轉換等)
Modeling 建模
Evaluation 評估模型
Deployment 佈署

1. 目標預測生存者

1.1 Objective
Predict survival on the Titanic

2. 了解資料集 Data Understanding

2.1 載入程式庫 Import Libraries

Helge Bjorland & Stian Eide 英文這份例子的程式碼已有點久(2017),
以下我們修正了一些已經棄用的 sklearn 的 import 模組的方法,
例如
#from sklearn.preprocessing import Imputer , Normalizer , scale
改成:
from sklearn.impute import SimpleImputer
而以下
#from sklearn.cross_validation import train_test_split , StratifiedKFold
則改成:
from sklearn.model_selection import train_test_split , StratifiedKFold

# 20210804 P-J Lai MATH NKNU
##Titanic（泰坦尼克之灾）
##- Kaggle入门，看这一篇就够了,
##英文教程：An Interactive Data Science Tutorial-Based on the Titanic
##competition on Kaggle
##https://www.kaggle.com/helgejo/an-interactive-data-science-tutorial

# An Interactive Data Science Tutorial_Titanic.py
################################################################

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Handle table-like data and matrices
import numpy as np
import pandas as pd

# Modelling Algorithms
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier , GradientBoostingClassifier

# Modelling Helpers
#from sklearn.preprocessing import Imputer , Normalizer , scale
from sklearn.impute import SimpleImputer
#from sklearn.cross_validation import train_test_split , StratifiedKFold
from sklearn.model_selection import train_test_split , StratifiedKFold
from sklearn.feature_selection import RFECV

# Visualisation
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns

# Configure visualisations
#%matplotlib inline
mpl.style.use( 'ggplot' )
sns.set_style( 'white' )
pylab.rcParams[ 'figure.figsize' ] = 8 , 6

sklearn 中的 Imputer 模块改动:

随着版本的更新，Imputer的输入方式也发生了变化，一开始的输入方式为

from sklearn.preprocessing import Imputer
imputer = Imputer(strategy=‘median’)

现在需要对上面输入进行更新，输入变为

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy=“median”)
————————————————
版权声明：本文为CSDN博主「远方与你」的原创文章，遵循CC 4.0 BY-SA版权协议，https://blog.csdn.net/qq_37388085/article/details/104350508

sklearn 中的 cross_validation 模块改动

新的模块sklearn.model_selection，将以前的sklearn.cross_validation, sklearn.grid_search 和 sklearn.learning_curve模块组合到一起
比如：cross_validation模块弃用，所有的包和方法都在model_selection中,包和方法名没有发生变化
详见
http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators-with-stratification-based-on-class-labels
Ref: https://blog.csdn.net/qi_1221/article/details/76071555

2.2 設定輔助函數 Setup helper Functions

這裡原文有說, 無須了解程式碼, 只需照著執行, 是為了方便後面的解說

def plot_histograms( df , variables , n_rows , n_cols ):
    fig = plt.figure( figsize = ( 16 , 12 ) )
    for i, var_name in enumerate( variables ):
        ax=fig.add_subplot( n_rows , n_cols , i+1 )
        df[ var_name ].hist( bins=10 , ax=ax )
        ax.set_title( 'Skew: ' + str( round( float( df[ var_name ].skew() ) , ) ) ) # + ' ' + var_name ) #var_name+" Distribution")
        ax.set_xticklabels( [] , visible=False )
        ax.set_yticklabels( [] , visible=False )
    fig.tight_layout()  # Improves appearance a bit.
    plt.show()

def plot_distribution( df , var , target , **kwargs ):
    row = kwargs.get( 'row' , None )
    col = kwargs.get( 'col' , None )
    facet = sns.FacetGrid( df , hue=target , aspect=4 , row = row , col = col )
    facet.map( sns.kdeplot , var , shade= True )
    facet.set( xlim=( 0 , df[ var ].max() ) )
    facet.add_legend()

def plot_categories( df , cat , target , **kwargs ):
    row = kwargs.get( 'row' , None )
    col = kwargs.get( 'col' , None )
    facet = sns.FacetGrid( df , row = row , col = col )
    facet.map( sns.barplot , cat , target )
    facet.add_legend()

def plot_correlation_map( df ):
    corr = titanic.corr()
    _ , ax = plt.subplots( figsize =( 12 , 10 ) )
    cmap = sns.diverging_palette( 220 , 10 , as_cmap = True )
    _ = sns.heatmap(
        corr, 
        cmap = cmap,
        square=True, 
        cbar_kws={ 'shrink' : .9 }, 
        ax=ax, 
        annot = True, 
        annot_kws = { 'fontsize' : 12 }
    )

def describe_more( df ):
    var = [] ; l = [] ; t = []
    for x in df:
        var.append( x )
        l.append( len( pd.value_counts( df[ x ] ) ) )
        t.append( df[ x ].dtypes )
    levels = pd.DataFrame( { 'Variable' : var , 'Levels' : l , 'Datatype' : t } )
    levels.sort_values( by = 'Levels' , inplace = True )
    return levels

def plot_variable_importance( X , y ):
    tree = DecisionTreeClassifier( random_state = 99 )
    tree.fit( X , y )
    plot_model_var_imp( tree , X , y )
    
def plot_model_var_imp( model , X , y ):
    imp = pd.DataFrame( 
        model.feature_importances_  , 
        columns = [ 'Importance' ] , 
        index = X.columns 
    )
    imp = imp.sort_values( [ 'Importance' ] , ascending = True )
    imp[ : 10 ].plot( kind = 'barh' )
    print (model.score( X , y ))

2.3 載入資料集 Load data

在 Kaggle 站可以下載 Titanic 的資料集:
There are three files in the data:
(1) train.csv,
(2) test.csv, and
(3) gender_submission.csv.
Ref: https://www.kaggle.com/c/titanic link

以下我的作法是下載到本地端, 在自己電腦上跑程式,
所以底下原先的路徑要改成你自己的位置,

(註: 但是, 另一種做法是在 Kaggle 上, 新增一個 Notebook, 在Kaggle站上面雲端編寫執行, 這時, 底下的路徑, 應該是照他的示範, 不要改, 此時, 底下的路徑是在Kaggle站的雲端硬碟的位置)

下載之後 , 將這三個 .csv 檔放在跟你的 .py 同樣的位置(在本地電腦同一個資料夾下),
將載入路徑之 ./input 刪除,
留下 ./ 代表在同一資料夾下
也就是將
train = pd.read_csv("../input/train.csv")
改成
train = pd.read_csv("./train.csv")

打開 Python IDLE, 或是用 Anaconda Jupter Notebook,PyCharm等,
新增一個草稿檔, 輸入程式碼
(注意 Anaconda Jupter Notebook 預設的當前工作資料夾是在較深處, 要查一下,
而 Python IDLE, 你只要把草稿檔存在例如桌面, 然後把下載 Titanic 的資料集, 也放在桌面, Python IDLE就會找到下載的資料集.)

##train = pd.read_csv("../input/train.csv")
##test    = pd.read_csv("../input/test.csv")

train = pd.read_csv("./train.csv")
test    = pd.read_csv("./test.csv")

full = train.append( test , ignore_index = True )
titanic = full[ :891 ]

del train , test

print ('Datasets:' , 'full:' , full.shape , 'titanic:' , titanic.shape)

輸出:

>>> 
Datasets: full: (1309, 12) titanic: (891, 12)

2.4 統計觀察 Statistical summaries and visualisations

先用 .head() 印出前五筆資料的長相, 主要看一下屬性有哪些

# 2.4 Statistical summaries and visualisations

print(titanic.head())

輸出:

>>> 
Datasets: full: (1309, 12) titanic: (891, 12)
   PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0            1       0.0       3  ...   7.2500   NaN         S
1            2       1.0       1  ...  71.2833   C85         C
2            3       1.0       3  ...   7.9250   NaN         S
3            4       1.0       1  ...  53.1000  C123         S
4            5       0.0       3  ...   8.0500   NaN         S

[5 rows x 12 columns]

我們可以直接用例如 sublimtext, notepad++, 小作家等, 打開 train.csv,
可以看到第一row(橫行) 就是所有屬性
PassengerId 乘客編號
Survived 是否生存下來
Pclass 艙等
Name 姓名,
Sex 性別,
Age 年紀,
SibSp, Parch, Ticket, Fare, Cabin, Embarked(登船位置港口名用縮寫 C, Q, S 表示) 等等.
像SibSp, Parch, Ticket, 就第一眼不太懂是甚麼, (Ticket 是票價或是編號?)
Titanic_train.csv

像 SibSp, Parch, Ticket, 等這些第一眼不太懂是甚麼的, 可以到 Kaggle 站看詳細的說明:
在kaggle 站鐵達尼號 https://www.kaggle.com/c/titanic link
處, 點選 data 分頁, 會有關於這個資料集的解說
https://www.kaggle.com/c/titanic/data link

Variable Description

Survived: Survived (1) or died (0)
Pclass: Passenger’s class
Name: Passenger’s name
Sex: Passenger’s sex
Age: Passenger’s age
SibSp: Number of siblings/spouses aboard (一同登船的表堂兄弟姊妹的個數)
Parch: Number of parents/children aboard (一同登船的父母與子女的個數)
Ticket: Ticket number
Fare: Fare
Cabin: Cabin
Embarked: Port of embarkation 登船位置港口名用縮寫 C, Q, S 表示

2.4.1 `describe()` 觀察屬性的最大最小值, 四分位數

類似 R 的 summary() 指令,

print(titanic.describe())

輸出
會列出該屬性的最大最小值, 四分位數

       PassengerId    Survived      Pclass  ...       SibSp       Parch        Fare
count   891.000000  891.000000  891.000000  ...  891.000000  891.000000  891.000000
mean    446.000000    0.383838    2.308642  ...    0.523008    0.381594   32.204208
std     257.353842    0.486592    0.836071  ...    1.102743    0.806057   49.693429
min       1.000000    0.000000    1.000000  ...    0.000000    0.000000    0.000000
25%     223.500000    0.000000    2.000000  ...    0.000000    0.000000    7.910400
50%     446.000000    0.000000    3.000000  ...    0.000000    0.000000   14.454200
75%     668.500000    1.000000    3.000000  ...    1.000000    0.000000   31.000000
max     891.000000    1.000000    3.000000  ...    8.000000    6.000000  512.329200

[8 rows x 7 columns]

2.4.2 A heat map of correlation

此處他用以下函數畫出個屬性的相關係數的熱力圖
plot_correlation_map( titanic )
plot_correlation_map( ) 事前彥已經定義過的輔助函數,
其實就是先呼叫
data.corr()
算出相關係數
再呼叫 seaborn 的 heatmap() 去畫出熱力圖
sns.heatmap()

plot_correlation_map( titanic )

Titanic_2.4.2 A heat map of correlation
可以一下子就得到第一步的印象: survive 跟 Fare(船票費用) 相關性最高.

2.4.3 進一步觀察各屬性與是否生存之關係程度

##2.4.3 Let's further explore the relationship
##between the features and survival of passengers

# Plot distributions of Age of passangers who survived or did not survive
plot_distribution( titanic , var = 'Age' , target = 'Survived' , row = 'Sex' )
plt.show()

Titanic_2.4.3distributions of Age of passangers who survived
他這裡指出, 如果, 例如用性別分為兩幅圖, 觀察 age 的分布圖, 分為是否生存的兩道曲線,
例如上圖, 區分性別, 年紀分布 vs. 生存與否, 所畫出的圖,
差異不大, 則用這個屬性去預測, 就不是好的預測屬性,
上圖我們可以看出,
男性的年紀分布 vs. 生存與否, 確實有差異
而女性的年分布 vs. 生存與否, 則差異不大.

Ex: Consider some key questions such as; what age does males/females have a higher or lower probability of survival?

2.4.3 Excersise 1: Investigating numeric variables

將以上用性別分為兩幅圖觀察 age 的分布圖, 分為是否生存的兩道曲線, 改成用 Fare 船票費用對生存的分布圖, 看看畫出的曲線是否明顯差異, 是否 Fare 示好的預測屬性?

################################################################################
##2.4.3 Excersise 1: Investigating numeric variables
##將以上用性別分為兩幅圖觀察 age對生存的分布圖, 改成用 Fare 船票費用對生存的分布圖,
##看看畫出的曲線是否明顯差異, 是否 Fare 示好的預測屬性?
plot_distribution( titanic , var = 'Fare' , target = 'Survived' , row = 'Sex' )
plt.show()

Titanic_2.4.3distributions of Fare of passangers who survived
上圖我們可以看出,
船票高低分布 vs. 生存與否, 確實差異不大.

2.4.4 三種登船位置與生存有關嗎 Embarked

Embarked 屬性: 含有三種登船位置港口名, 用縮寫 C, Q, S 表示:
C = Cherbourg
Q = Queenstown
S = Southampton

將三種登船位置港口名用 C, Q, S 表示

離散型的屬性, 例如 Embarked 三種登船位置,
此種離散型的屬性可以用 seaborn 的 categorical(分類) 類的繪圖指令來畫 sns.carplot()

# Plot survival rate by Embarked
### 2.4.4 三種登船位置與生存有關嗎 Embarked
##將三種登船位置港口名用 C, Q, S  表示
##C = Cherbourg
##Q = Queenstown
##S = Southampton

plot_categories( titanic , cat = 'Embarked' , target = 'Survived' )
plt.show()

Titanic_2.4.4 embarked vs survived

2.4.4 Excersise 2 - 5: Investigating categorical variables

Even more coding practice! Try to plot the survival rate of Sex, Pclass, SibSp and Parch below.

Hint: use the code from the previous cell as a starting point.

After considering these graphs, which variables do you expect to be good predictors of survival?

Excersise 2
觀察性別與生存的關係 Plot survival rate by Sex

由以下之柱狀圖, 發現性別與生存與否確實關係很大!

我們只需把以上程式碼, 將 cat = 'Embarked' 改成
cat = 'Sex' 即可. ( cat 這裡是 categorical 的意思)

##Excersise 2
##觀察性別與生存的關係 Plot survival rate by Sex

plot_categories( titanic , cat = 'Sex' , target = 'Survived' )
plt.show()

Titanic_2.4.4 Sex vs survived

Excersise 3
觀察艙等與生存的關係 Plot survival rate by Pclass
Excersise 4
觀察表堂兄弟姊妹人數與生存的關係 Plot survival rate by SibSp
Excersise 5
觀察父母子女人數與生存的關係 Plot survival rate by Parch

3. 資料前置處理 Data Preparation

一般, 在資料分析的整個過程, 其實最累的是資料前置處理這個部分,

他包含清洗, 變換, 合併等過程,

清洗: 例如把空的位置填補新值
變換: 變換, 重整, 合併, 拆分現有屬性成新的屬性等
合併: 把上面處理好的各屬性合併成訓練集

初學者, 清洗這部分一定要學會處理, 否則到後面輸入模型輸入訓練時, 會出現錯誤訊息,
而變換, 合併等後續步驟, 有點繁瑣, 可以先跳過, 為避免初學者還沒走到最後, 享受完成整個資料分析流程的成就感, 就被太多枝節嚇到, 而放棄.

清洗資料
在把資料輸入模型做訓練之前, 需先清洗資料,
原始的資料可能無法直接輸入模型做訓練, 例如

有些變數屬性是離散型的, 而不是整數或是實數(浮點數), 例如, Titanic 中的性別, 這個屬性, Sex: Male, Female
2.另一種情況是, 會有一些位置是空的, 或是輸入時是錯別字, 無法辨認原始的含義.

第一種的解法, 是轉成例如, 0, 1 等
第二種的解法, 是將空的位置,填入 np.nan 等.

變換
而變換這部分, 例如, 把, 父母子女同船人數 + 表堂兄弟姊妹同船人數, 定義為一個新的屬性叫做家族人數, 等等動作.

合併
就是把上面處理好的新舊屬性, 挑出自己覺得重要的, 合併為訓練集.

清洗資料 3.1

3.1.1 把離散型屬性轉成數值 Categorical variables need to be transformed to numeric variables

離散型屬性例如, Sex: Male, Female
離散型的屬性, 例如 Embarked: 含有三種登船位置港口名, 用縮寫 C, Q, S 表示,
資料前置處理第一步, 會先將這些離散型屬性轉成數值型,
例如, 通常作法,
將 Male 男性對應到整數 1, 將 Female 女性對應到整數 0

# 3. Data Preparation

# Transform Sex into binary values 0 and 1
sex = pd.Series( np.where( full.Sex == 'male' , 1 , 0 ) , name = 'Sex' )
print(full.Sex)
print(sex)

輸出

Name: Sex, Length: 1309, dtype: object
0       1
1       0
2       0
3       0
4       1
       ..
1304    1
1305    0
1306    1
1307    1
1308    1

而, Embarked: C, Q, S, 則轉成 0, 1, 2.
而這篇英文的作法, 則是將 Embarked 拆分出3個屬性 Embarked_C, Embarked_Q, Embarked_S,
有在 C 港口登船, Embarked_C 值才是1, 否則是 0, 以此類推.

# Create a new variable for every unique value of Embarked
embarked = pd.get_dummies( full.Embarked , prefix='Embarked' )
print(embarked.head())

輸出

Name: Sex, Length: 1309, dtype: int32
   Embarked_C  Embarked_Q  Embarked_S
0           0           0           1
1           1           0           0
2           0           0           1
3           0           0           1
4           0           0           1

# Create a new variable for every unique value of Pclass
pclass = pd.get_dummies( full.Pclass , prefix='Pclass' )
print(pclass.head())

輸出

   Pclass_1  Pclass_2  Pclass_3
0         0         0         1
1         1         0         0
2         0         0         1
3         1         0         0
4         0         0         1

3.1.2 填補缺失值 Fill missing values in variables

一般資料集, 會有一些位置是空的, 或是輸入時是錯別字, 無法辨認原始的含義,
這時, 通常會以平均值來填補缺失值,
或是填入 np.nan.
這個動作就叫資料清洗,

# Create dataset
imputed = pd.DataFrame()

# Fill missing values of Age with the average of Age (mean)
imputed[ 'Age' ] = full.Age.fillna( full.Age.mean() )

# Fill missing values of Fare with the average of Fare (mean)
imputed[ 'Fare' ] = full.Fare.fillna( full.Fare.mean() )

print(imputed.head())

輸出

    Age     Fare
0  22.0   7.2500
1  38.0  71.2833
2  26.0   7.9250
3  35.0  53.1000
4  35.0   8.0500

3.2 特徵工程- 產生新變數 Feature Engineering – Creating new variables

3.2.1 從乘客名稱抽取出頭銜 Extract titles from passenger names

這裡的動作會比較複雜,
乘客名稱中的頭銜, 有很多是當時約100年前英美社會的尊稱, 我們就照著這篇英文做就是了,
他文章寫道:
“Titles reflect social status and may predict survival probability”
(頭銜反映社會地位, 可能反映生存的機率)

在約100年前的英國社會, 頭銜應該是蠻重要的, 是否會因此決定他能優先坐救生艇, 這也是資料分析可以試圖挖掘的.

#3.3.1 Extract titles from passenger names

title = pd.DataFrame()
# we extract the title from each name
title[ 'Title' ] = full[ 'Name' ].map( lambda name: name.split( ',' )[1].split( '.' )[0].strip() )

# a map of more aggregated titles
Title_Dictionary = {
                    "Capt":       "Officer",
                    "Col":        "Officer",
                    "Major":      "Officer",
                    "Jonkheer":   "Royalty",
                    "Don":        "Royalty",
                    "Sir" :       "Royalty",
                    "Dr":         "Officer",
                    "Rev":        "Officer",
                    "the Countess":"Royalty",
                    "Dona":       "Royalty",
                    "Mme":        "Mrs",
                    "Mlle":       "Miss",
                    "Ms":         "Mrs",
                    "Mr" :        "Mr",
                    "Mrs" :       "Mrs",
                    "Miss" :      "Miss",
                    "Master" :    "Master",
                    "Lady" :      "Royalty"

                    }

# we map each title
title[ 'Title' ] = title.Title.map( Title_Dictionary )
title = pd.get_dummies( title.Title )
#title = pd.concat( [ title , titles_dummies ] , axis = 1 )

print(title.head())

輸出

   Master  Miss  Mr  Mrs  Officer  Royalty
0       0     0   1    0        0        0
1       0     0   0    1        0        0
2       0     1   0    0        0        0
3       0     0   0    1        0        0
4       0     0   1    0        0        0

3.2.2 從船艙編號抽出船艙的分類 Extract Cabin category information from the Cabin number

我們用小作家打開 test.csv, 發現
cabin 有很多缺失值,
cabin 的編號, 在倒數第二位,
PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked

16 筆 cabin 是 E31,
916 筆 cabin 是 B57 B59 B63 B66
清洗的動作:

只取出 cabin 的第一個英文字母: cabin[ 'Cabin' ].map( lambda c : c[0] )
缺失值位置用 U 代替: full.Cabin.fillna( 'U' )
新增 Cabin_A Cabin_B, Cabin_U 變數

# 3.3.2 Extract Cabin category information from the Cabin number
cabin = pd.DataFrame()

# replacing missing cabins with U (for Uknown)
cabin[ 'Cabin' ] = full.Cabin.fillna( 'U' )

# mapping each Cabin value with the cabin letter
cabin[ 'Cabin' ] = cabin[ 'Cabin' ].map( lambda c : c[0] )

# dummy encoding ...
cabin = pd.get_dummies( cabin['Cabin'] , prefix = 'Cabin' )

print(cabin.head())

輸出

   Cabin_A  Cabin_B  Cabin_C  Cabin_D  ...  Cabin_F  Cabin_G  Cabin_T  Cabin_U
0        0        0        0        0  ...        0        0        0        1
1        0        0        1        0  ...        0        0        0        0
2        0        0        0        0  ...        0        0        0        1
3        0        0        1        0  ...        0        0        0        0
4        0        0        0        0  ...        0        0        0        1

[5 rows x 9 columns]

3.2.3 從船票號碼抽出等級 Extract ticket class from ticket number

我們用小作家打開 test.csv, 發現
ticket 似乎沒缺失值,
ticket 的編號, 在倒數第4位,
PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked

ticket 的編號, 有些只有數字, 有些前面帶有英文字母
第 892 筆 330911
第 906 筆 W.E.P. 5734
第 907 筆 SC/PARIS 2167
第 1022 筆 STON/OQ. 369943
等等

以下程式碼的動作: str.replace(), str.split(), str.strip() 都是 Python 對字串的內建指令, (Pandas 也有對應的指令, 是對整個 column 一次做同樣的處理).

清除英文字母中的 . 句號 ticket.replace( '.' , '' )
例如 W.E.P. 5734 $\overset{ ticket.replace( "." , "" )}{\longrightarrow}$ WEP 5734
清除英文字母中的 / 斜線 ticket = ticket.replace( '/' , '' )
例如 SC/PARIS 2167 $\overset{ ticket.replace( "/" , "" )}{\longrightarrow}$ SCPARIS 2167
將字串做切分, ticket = ticket.split() 預設是用空白來切分 In the default setting, the string is split by whitespace,
例如, SCPARIS 2167 $\overset{ticket = ticket.split()}{\longrightarrow}$ [SCPARIS, 2167]
lambda t : t.strip()
Python str.strip() 方法用于移除字符串头尾指定的字符（默认为空格或换行符）或字符序列,
例如, [SCPARIS, 2167] $\overset{lambda t : t.strip()}{\longrightarrow}$ [SCPARIS,2167]
會將 2167 之前的空白移除.

他的 codes 是先寫一個小函數 cleanTicket(),
對輸入之字串做以上一連串之處理,
(注意: cleanTicket() 輸入的資料型態是字串, 不是 pandas 的 DataFrame的格式)
再使用 map() 對 DataFrame 的一個屬性(整個 column) 一次做 cleanTicket() 同樣的處理: full[ 'Ticket' ].map( cleanTicket )

#3.3.3 Extract ticket class from ticket number

# a function that extracts each prefix of the ticket, returns 'XXX' if no prefix (i.e the ticket is a digit)
def cleanTicket( ticket ):
    ticket = ticket.replace( '.' , '' )
    ticket = ticket.replace( '/' , '' )
    ticket = ticket.split()
    ticket = map( lambda t : t.strip() , ticket )
    ticket = list(filter( lambda t : not t.isdigit() , ticket ))
    if len( ticket ) > 0:
        return ticket[0]
    else: 
        return 'XXX'

ticket = pd.DataFrame()

# Extracting dummy variables from tickets:
ticket[ 'Ticket' ] = full[ 'Ticket' ].map( cleanTicket )
ticket = pd.get_dummies( ticket[ 'Ticket' ] , prefix = 'Ticket' )

print(ticket.shape)
print(ticket.head())

輸出, 就是 full.ticket 被清理過的樣貌

(1309, 37)
   Ticket_A  Ticket_A4  Ticket_A5  ...  Ticket_WC  Ticket_WEP  Ticket_XXX
0         0          0          1  ...          0           0           0
1         0          0          0  ...          0           0           0
2         0          0          0  ...          0           0           0
3         0          0          0  ...          0           0           1
4         0          0          0  ...          0           0           1

[5 rows x 37 columns]

3.2.3.1 改成直接用 Pandas 的指令處理(向量化的指令)

以下我們試著將 Helge Bjorland & Stian Eide 這篇, 較繁瑣的做法, 改成直接用 Pandas 的指令處理

他的 codes 是先寫一個小函數 cleanTicket(),
對輸入之字串做以上一連串之處理,
再使用 map() 對 DataFrame 的一個屬性(整個 column) 一次做 cleanTicket() 同樣的處理: full[ 'Ticket' ].map( cleanTicket )

這可能是早期的作法, 隨著 Pandas 的向量化的指令越來越多, 這些早期的作法, 都可以簡化為一個 Pandas 的指令

上面我們用到 Python 對字串的內建指令
str.replace(), str.split(), str.strip()
我們搜尋 Pandas 類似的指令
在 user_guide/Working with text data 處有所有 Pandas 可以處理字串的指令說明 link
我們找到 Pandas 對應的指令:

pandas.Series.str.replace(): Replace each occurrence of pattern/regex in the Series/Index.
Equivalent to str.replace()(等價於Python 的 str.replace()) or re.sub(), depending on the regex value.
pandas.Series.str.split(): Split strings around given separator/delimiter.
Splits the string in the Series/Index from the beginning, at the specified delimiter string. Equivalent to str.split().
pandas.Series.str.strip(): Strip whitespaces (including newlines) or a set of specified characters from each string in the Series/Index from left and right sides. Equivalent to str.strip() (等價於Python 的 str.strip()).

用法與 Python 的指令類似, Pandas 是對整個 column 一次做同樣的處理, 是向量化的指令

注意這些操作都不會改變原來的資料, 他回傳回新的物件!

Pandas 的 Series 的資料, 假設叫 data, 在呼叫這些指令, 與 Python 的指令只差在中間多一個 .str. 的前置詞

Python 的指令	Pandas 的指令
(單一)字串.replace()	data(一整直行的字串).str.replace()
字串.split()	data(一整直行的字串).str.split()
字串.strip()	data(一整直行的字串).str.strip()

以下我們直接對 full.Ticket 做一連串的 pd.str.replace() 的動作:

先觀察原有的長相

full.Ticket.head()

輸出

0           A/5 21171
1            PC 17599
2    STON/O2. 3101282
3              113803
4              373450
Name: Ticket, dtype: objectticket1 = full.Ticket.str.replace('.', '')
ticket1.head()

執行 Ticket.str.replace('.', '')

ticket1 = full.Ticket.str.replace('.', '')
ticket1.head()

輸出, 發現整條之句號已經被移除掉

0          A/5 21171
1           PC 17599
2    STON/O2 3101282
3             113803
4             373450
Name: Ticket, dtype: object

執行 Ticket.str.replace('/', '')

ticket2 = ticket1.str.replace('/', '')
ticket2.head()

輸出, 發現整條之斜線已經被移除掉

0          A5 21171
1          PC 17599
2    STONO2 3101282
3            113803
4            373450
Name: Ticket, dtype: object

執行 Ticket.str.split()

ticket3 = ticket2.str.split()
ticket3.head()

輸出, 發現整條之英文與數字已拆開用逗號分隔, 格式為 list

0          [A5, 21171]
1          [PC, 17599]
2    [STONO2, 3101282]
3             [113803]
4             [373450]
Name: Ticket, dtype: object

到此, ticket3 每欄元素是一個 list, 如果有英文, 就會有兩個分量,
如果沒有英文, 就會只有一個分量,
之後我們改寫 Helge Bjorland & Stian Eide 原來的作法,
目的是將, 每欄元素只留下英文的部分, 如果沒有英文只有數字, 就改成 XXX 字串.
我們先定義一個匿名函數, 輸入 list, 如果 len(list)長度是 1, 表示沒有英文的部分, 只有數字, 就傳回字串 ‘XXX’, 否則就傳回 list[0], list 的第一個部分, 就是英文的部分:
lambda t: 'XXX' if len(t)==1 else t[0]
再用 map() 結合 lambda t: 'XXX' if len(t)==1 else t[0] 對整個 ticket3 做處理:
ticket4=ticket3.map(lambda t: 'XXX' if len(t)==1 else t[0])

ticket4=ticket3.map(lambda t: 'XXX' if len(t)==1 else t[0])
ticket4.head()

輸出, 發現已經完成跟 Helge Bjorland & Stian Eide 原來的 codes 一樣的效果

0        A5
1        PC
2    STONO2
3       XXX
4       XXX
Name: Ticket, dtype: object

之後 codes 與Helge Bjorland & Stian Eide 一樣,
再用
pd.get_dummies()
對 ticket4 做拆分

3.2.4 Create family size and category for family size

Parch 是父母與子女同船之數目,
SibSp 是表堂兄弟姊妹同船之數目,
以下 codes, 他用 Parch, SibSp 來計算親屬同船的總人數
The two variables Parch and SibSp are used to create the famiy size variable

他區分, 家族人數在 2~4 個, 就定義為 ‘Family_Small’
家族人數大於等於 5 個, 就定義為 ‘Family_Large’

##3.3.4 Create family size and category for family size
##The two variables Parch and SibSp are used to create the famiy size variable


family = pd.DataFrame()

# introducing a new feature : the size of families (including the passenger)
family[ 'FamilySize' ] = full[ 'Parch' ] + full[ 'SibSp' ] + 1

# introducing other features based on the family size
family[ 'Family_Single' ] = family[ 'FamilySize' ].map( lambda s : 1 if s == 1 else 0 )
family[ 'Family_Small' ]  = family[ 'FamilySize' ].map( lambda s : 1 if 2 <= s <= 4 else 0 )
family[ 'Family_Large' ]  = family[ 'FamilySize' ].map( lambda s : 1 if 5 <= s else 0 )

print(family.head())

輸出:

   FamilySize  Family_Single  Family_Small  Family_Large
0           2              0             1             0
1           2              0             1             0
2           1              1             0             0
3           2              0             1             0
4           1              1             0             0

3.3 合併處理過的屬性 Assemble final datasets for modelling

以下暫時定義 full_X 是將之前清洗過的新的屬性
imputed , embarked , pclass , sex , family , cabin , ticket 等等,
挑出一些(選擇出你覺得重要的會影響 survive 的屬性), 合併成一個 DataFrame

再用這個 DataFrame 進行後續回歸等模型的訓練與預測動作.

3.4.1 用 `pd.concat` 合併重要屬性 Variable selection

先從以下選擇出你覺得重要的會影響 survive 的屬性
Select which features/variables to inculde in the dataset from the list below:

imputed
embarked
pclass
sex
family
cabin
ticket

例如你可以
暫時定義 full_X 是將之前清洗過的新的屬性
imputed , embarked , cabin , sex 等等,
合併成一個 DataFrame
直接用
pd.concat
做合併動作

# Select which features/variables to include in the dataset from the list below:
# imputed , embarked , pclass , sex , family , cabin , ticket

full_X = pd.concat( [ imputed , embarked , cabin , sex ] , axis=1 )
full_X.head()

輸出:

    Age     Fare  Embarked_C  Embarked_Q  ...  Cabin_G  Cabin_T  Cabin_U  Sex
0  22.0   7.2500           0           0  ...        0        0        1    1
1  38.0  71.2833           1           0  ...        0        0        0    0
2  26.0   7.9250           0           0  ...        0        0        1    0
3  35.0  53.1000           0           0  ...        0        0        0    0
4  35.0   8.0500           0           0  ...        0        0        1    1

[5 rows x 15 columns]

3.4.2 區分訓練集與測試集 Create datasets

有關訓練集/驗證集/測試集的直觀解釋可以參考附錄

接下來我們將 full_X 拆開為訓練集/驗證集, 以利後面進行訓練.
Below we will seperate the data into training and test datasets.

在呼叫 train_test_split() 時, 就把 full_X 拆分為訓練集/驗證集 :
train_X , valid_X , train_y , valid_y = train_test_split( train_valid_X , train_valid_y , train_size = .7 )

這是很多文章在此有不同的叫法

按照 sklearn : 原始data $\rightarrow$ 70% 用在data_train, 30% 用在data_test -> 拿未知答案的資料做預測 data_predice
如果是這篇 Titanic 例子, 知道所有答案, 按照 Helge Bjorland & Stian Eide :
– 1. 原始 full_X 分約 70%(891/1309), 叫做 train_valid_X $\rightarrow$ train_valid_X 分70% 用在 train_X訓練集, train_valid_X 分30% 用在 valid_X驗證集
– 2. train_valid_y 分70% 用在 train_y訓練集答案, train_valid_y 分30% 用在 valid_y驗證集答案
– 3. 拿 full_X 剩的 30% 叫 test, 假裝未知答案, 做預測

上面 full_X 只取用清洗過的新的屬性 4 個
imputed , embarked , cabin , sex,

接著從 full_X 中取出前段 0~890 個資料點約 70%(891/1309)
定義為訓練集 x train_valid_X
train_valid_X = full_X[ 0:891 ]
再從原始的 full (注意, 不是 full_X) 的 0~890 個資料點
titanic = full[ :891 ]
之前已經定義為 Titanic,
只取 Titanic 的 Survived 的屬性, 用作答案, 檢查跑出結果是否正確
定義為y 集 train_valid_y
train_valid_y = titanic.Survived
再用 full_X 的後段 890~1309 個資料點
定義為測試集 test_X
test_X = full_X[ 891: ]
train_valid_X 分70% 用在 train_X訓練集, train_valid_X 分30% 用在 valid_X驗證集
train_valid_y 分70% 用在 train_y訓練集答案, train_valid_y 分30% 用在 valid_y驗證集答案

拆分前的資料集	訓練集X	驗證集 X	訓練集的 y 集	驗證集的 y 集	測試(預測)集取後段 890~1309 個假裝不知道答案
full	-		full[ :891 ].Survived	full[ :891 ].Survived
full_X 只取用 full 的 4 個屬性 imputed , embarked , cabin , sex,					full_X[ 891: ]
full_X[ 0:891 ] 取出前段 0~890 個	用`train_test_split()` 分出 70% 為 train_X 訓練集	用`train_test_split()` 分出 30% 為 valid_X驗證集
full[ :891 ].Survived			用`train_test_split()` 分出 70% 為 train_y 訓練集答案	用`train_test_split()` 分出 30% 為 valid_y驗證集答案

底下負責切割訓練集跟驗證集的函數 train_test_split()
是從 sklearn.cross_validation 載入的函數(方法)
用這個函數對訓練集 train_valid_X 與答案集 train_valid_y 做切分, 按 0.7 的比例(70%).
train_X , valid_X , train_y , valid_y = train_test_split( train_valid_X , train_valid_y , train_size = .7 )

#3.4.2 Create datasets
#Below we will seperate the data into training and test datasets.

# Create all datasets that are necessary to train, validate and test models
train_valid_X = full_X[ 0:891 ]
train_valid_y = titanic.Survived
test_X = full_X[ 891: ]

# train_test_split() 是 sklearn.cross_validation 的
# from sklearn.cross_validation import train_test_split , StratifiedKFold
train_X , valid_X , train_y , valid_y = train_test_split( \
    train_valid_X , train_valid_y , train_size = .7 )

print (full_X.shape , train_X.shape , valid_X.shape , \
       train_y.shape , valid_y.shape , test_X.shape)

##[5 rows x 15 columns]
##(1309, 15) (623, 15) (268, 15) (623,) (268,) (418, 15)

輸出

[5 rows x 15 columns]
(1309, 15) (623, 15) (268, 15) (623,) (268,) (418, 15)

3.4.3 挑出重要的屬性 Feature importance

Selecting the optimal features in the model is important. We will now try to evaluate what the most important variables are for the model to make the prediction.

底下 plot_variable_importance( X , y ) 函數是從
2.2 輔助函數 Setup helper Functions 已定義好的

### 3.4.3 挑出重要的屬性 Feature importance
##Selecting the optimal features in the model is important.
##We will now try to evaluate what the most important variables
##are for the model to make the prediction.

plot_variable_importance(train_X, train_y)
plt.show()

Titanic_3.4.3 挑出重要的屬性

4. Modeling 建模

4.1 Model Selection

接著, 選擇適當的模型, 一般較早期的資料分析或資料採礦常用的模型方法, 是回歸, 決策樹(隨機森林), k-mean, KNN (K 最近鄰), 貝葉斯模型, Support Vector Machines 支持向量機, 關聯分析等, 較後期則有類神經網路, Gradient Boosting 提度提升, 集成學習, 深度學習等,
他建議可以先從 logisic 回歸開始.
Then there are several options to choose from when it comes to models. A good starting point is logisic regression.
We will now select a model we would like to try then use the training dataset to train this model and thereby check the performance of the model using the test set.

在這章,他列出了部分 sklearn 已有的現成模型方法:

4.1.1 隨機森林 Random Forests Model

model = RandomForestClassifier(n_estimators=100)

輸出

0.9935794542536116

4.1.2 支持向量機 Support Vector Machines

4.1.3 提度提升 Gradient Boosting Classifier

4.1.4 K 最近鄰 K-nearest neighbors

4.1.5 貝葉斯 Gaussian Naive Bayes

4.1.6 邏吉斯回歸 Logistic Regression

4.2 Train the selected model

訓練模型的過程:
選定用某個模型訓練, 例如假設是邏吉斯回歸方法,
輸入一筆一筆(或一整批一整批 batch) 資料進入,
模型會根據他自己的演算法用誤差對模型的參數做修正更新,
進行很多輪的訓練, 直到參數收斂, 形成一個穩定的模型,

輸入的資料會有一些切分的處理,
通常是已知答案的資料
(例如, 用 2000-2020 年的股市, 當作訓練資料, 用當時的屬性例如天氣, GDP, 物價指數等, 來預測當時的股市價格, 答案就是 2000-2020 年的股市價格, 當然已經知道)
如果是用已知答案的資料來訓練的方法(模型), 就叫監督式模型 supervised models, 例如回歸, 類神經網路等.
(如果是用還未知答案的資料來訓練的方法(模型), 就叫非監督式模型 unsupervised models, 例如 k-mean, KNN等)

一般將訓練資料例如叫 X, 拆開為訓練集/驗證集,
可能按　0.7, 0.3 來切( 70%, 30%) 或 60%, 40%等, 沒有一定的規定,
也就是分出 X 的 70% 叫 train_X 訓練集, 分出 X 的 30% 叫 valid_X 驗證集,
以利後面進行訓練.

之前3.4.2 有講到,
從 sklearn.cross_validation 載入負責切割訓練集跟驗證集的函數 train_test_split()
用這個函數對訓練集 train_valid_X 與答案集 train_valid_y 做切分, 按 0.7 的比例(70%).
train_X , valid_X , train_y , valid_y = train_test_split( train_valid_X , train_valid_y , train_size = .7 )

當我們選定用某個模型訓練時, , 例如假設是邏吉斯回歸方法,
則用上面切割出的訓練集跟訓練集答案執行
model.fit( train_X , train_y )
這時, 這個模型用 train_X 訓練, train_X 訓練集的答案是 train_y , 根據 train_y 可以得到誤差,
所謂訓練, 就是用例如邏吉斯回歸方法, 一筆一筆(或一整批一整批 batch) 資料進入, 模型會根據他自己的演算法用誤差對模型的參數做修正更新,
例如, 如果是類神經網路, 就用梯度下降法等, 根據誤差對網路權重修正,

就這樣一個指令, 他就會自己進行很多輪的訓練, 直到參數收斂, 形成一個穩定的模型,
接著, 就要對這個收斂的模型, 進行評估, 看看是否合用,

所以資料分析, 複雜的常是在前段的資料收集跟清洗部分, 後段的模型訓練, 常常有現成封裝好的指令可以用, 反而很快

但是如果想知道背後演算法的原理, 還是需要花時間了解,
在一開始學時, 可以先練習直接操作, 較有成就感, 以免一開始, 就陷入複雜的資料分析與機器學習的演算法原理, 容易產生挫折, 陷入見樹不見林的狀況.

此時之前train_test_split() 切分出的驗證集 valid_X,
就是用來評估模型用的, 請接著看 sec 5 Evaluation 評估模型.

而之前定義的測試集 test_X, 有時也叫做預測集,
test_X = full_X[ 891: ]
則是用在最後, 佈署模型用的, 請接著看 sec 6 Deployment 佈署.

5. Evaluation 評估模型

底下說明, 模型評估, 一般的做法, 是比較兩個集合訓練集與測試集(sklearn 叫 test set, 他叫他們為 valid_X 等), 跑出來的正確率,
模型是用 train_X 訓練集訓練, 會有一個整個訓練集出來的正確率,
要查看正確率只要下
model.score( train_X , train_y )
即可,

接著再叫他跑測試集, 得到一個正確率
要得到正確率只要下
model.score( valid_X , valid_y )
即可,
兩個正確率如果相距不大, 表示此模型是穩定的, 如果兩次正確率都高, 則這個模型就是我們覺得(至少初步)可以使用的模型.

最後再用之前先定義好的 test_X = full_X[ 891: ]當作預測集, 跑出預測結果(分數), 在最後那節會講到 6. Deployment 佈署, 將這結果用來提交 (submit)到 Kaggle 上用.

We can evaluate the accuracy of the model by using the validation set where we know the actual outcome. This data set have not been used for training the model, so it’s completely new to the model.

We then compare this accuracy score with the accuracy when using the model on the training data. If the difference between these are significant this is an indication of overfitting. We try to avoid this because it means the model will not generalize well to new data and is expected to perform poorly.

6. Deployment 佈署

這節, 對未來有心要參加 Kaggle 競賽的同學, 是很重要的, 雖然很簡短, 但是把主要的步驟講出.

在 Kaggle 上提交, 其實有點複雜, 網頁的點選, 不是很清楚, 不須不斷的東試西試, 才試出來:

產生新的 notebook

要先登入 Kaggle 自己帳號, 先產生新的 notebook, 就是一份有程式碼的頁面,
可以到 Code 那按下 + New Notebook
產生新的notebook_Code

會進入一個有先寫好必要程式碼(載入程式庫等, 是系統自己產生)的新的頁面
產生新的notebook_Code_預設的新頁面
就可以輸入程式碼, 可以選擇用 Python 或是 R. 用法跟 Jupter Notebook 類似.

另一個辦法是, 是直接 copy 別人的 Notebook, 加以修改, 再提交.
我下面的提交示範, 就是先直接 copy
Helge Bjorland & Stian Eide, An Interactive Data Science Tutorial-Based on the Titanic competition on Kaggle link
的 Notebook,
加以修改, 再提交.

執行程式

執行以下程式, 要在 Kaggle 自己帳號的 notebook上執行, 且進入Edit 編輯狀態:

test_Y = model.predict( test_X )
passenger_id = full[891:].PassengerId
test = pd.DataFrame( { 'PassengerId': passenger_id , 'Survived': test_Y } )
test.shape
test.head()
test.to_csv( 'titanic_pred.csv' , index = False )

點選上方 Run All 執行所有 codes

接著,

提交參加競賽 Submit to Competition

離開編輯狀態, 才會在 Notebook 的右上方看到 Submit to Competition , 點選, 就進入提交參加競賽的過程,
修正之_Titanic_by Helge Bjorland & Stian Eide_submit畫面

如果程式碼有錯, submit 的按鈕會沒有反白
修正之_Titanic_by Helge Bjorland & Stian Eide_submit畫面_1

查看提交結果 View My Submissions

在修正完程式碼, submit 的按鈕就可以按下, 就可以提交了.
接著, 點, View My Submissions, 查看提交之後的結果
修正之_Titanic_by Helge Bjorland & Stian Eide_submit畫面_2成功

查看提交狀態, 發現分數是 0 分

修正之_Titanic_by Helge Bjorland & Stian Eide_submit畫面_3_0分

上傳預測結果 Submit Predictions

發現中間, pred.csv 預測的. csv 檔, 還沒提交, 在跑 Notebook 時, 其實就已經有一份存在 Kaggle 的雲端硬碟上,
用
pwd
查看, 發現是放在以下
‘/kaggle/working’
但是在 Kaggle 網頁,找了半死, 才在work 這裡找到,
只好先下載到本地硬碟, 再想法上傳,

另一個辦法就是, 把 codes 下載到本地硬碟, 在自己電腦跑完, 產生pred.csv, 再提交.

底下是點選 Submit Predictions畫面, 指到自己資料夾的位置, 就會上傳
修正之_Titanic_by Helge Bjorland & Stian Eide_submit畫面_4_提交csv

查看自己在 leaderborad 的排名

因為 0 分, 發現自己 Peng-Jen Lai 目前是排在最後一名.
修正之_Titanic_by Helge Bjorland & Stian Eide_submit畫面_5_leaderboard

修正上傳結果

發現 0分的原因是, 上傳的 titanic_pred.csv 裡 Survived 的資料型態是 float64, 應該要是 int, 系統才有辦法評分,

修正之_Titanic_by Helge Bjorland & Stian Eide_submit畫面_6_發現是float

接下來做改變 dtype 的工作,

將上傳的 titanic_pred.csv 讀入Python 為 pd.DataFrame,
加上指定 dtype的指令 dtype='int64':
titanic_pred2=pd.read_csv(‘titanic_pred.csv’, dtype='int64')
再用 to_csv() 存成 .csv 檔
注意, 要加 index=False, 才會把下標那直行去掉!
titanic_pred2.to_csv('titanic_pred_int.csv', index = False )

程式碼如下

import pandas as pd
titanic_pred1=pd.read_csv('titanic_pred.csv')
titanic_pred2=pd.read_csv('titanic_pred.csv',dtype='int64')
print(titanic_pred1.head())
print(titanic_pred2.head())

##輸出
##>>> 
##=========== RESTART: C:/Users/user/Desktop/pd.read_csv_dtype=int64.py ==========
##   PassengerId  Survived
##0          892       0.0
##1          893       0.0
##2          894       0.0
##3          895       1.0
##4          896       1.0
##   PassengerId  Survived
##0          892         0
##1          893         0
##2          894         0
##3          895         1
##4          896         1

# 注意, 要加 index=False, 才會把下標那直行去掉!
titanic_pred2.to_csv('titanic_pred_int.csv',index = False )

重新上傳預測結果

上傳正確的資料型態後, 發現分數變為 0.7488, 第一次嘗試提交競賽, 這個分數算不差了! (當然是因為站在熱心網友的肩膀上!)
修正之_Titanic_by Helge Bjorland & Stian Eide_submit畫面_7_改int_0.7488分

重新查看 leaderboard

點選 Jump to your position on the leaderboard
在这里插入图片描述
發現排名已經推進一些到 47212 號,
0.7488應該不是 47212/49103=0.9614891 的比值, 49103是全體人數, 因為有很多同分的.

修正之_Titanic_by Helge Bjorland & Stian Eide_submit畫面_9_新leaderboard

其實 Titanic 競賽早已經結束, 這些排名只是讓初學者當作入門練習用的.

Deployment in this context means publishing the resulting prediction from the model to the Kaggle leaderboard. To do this do the following:

底下這篇寫的提交步驟做法, 是2017當時的作法, 目前頁面已經不太一樣:
select the cell below and run it by pressing the play button.
Press the Publish button in top right corner.
Select Output on the notebook menubar
Select the result dataset and press Submit to Competition button

附錄

Train / Test dataset 訓練集/驗證集/測試集

Ref: CHEN TSU PEI, 機器學習怎麼切分資料：訓練、驗證、測試集, link.

舉一個例子來說，把我們的模型想成一個高中生，平常上課學習時吸收的資料，像是課本、講義等，我們稱為訓練集，而模擬考，就好像是驗證集，從模擬考中，我們可以有一個概略的指標來看哪個學生學的比較好，對於學生來說，從模擬考的成績好壞，進而調整學習的方向，而測試集就是學測或指考，得到的分數就直接評估了學生的優劣

训练集(train)验证集(validation)测试集(test)与交叉验证法, https://zhuanlan.zhihu.com/p/35394638 link

為什麽 Machine Learning 常常會看到 X, y ?

以下摘錄自 Seachaos 的文章, 我發現他對機器學習用到的一些慣用法及名詞, 的講解很直觀鮮明, 一聽就懂:
資料視覺化之 Decision tree (決策樹)範例與 Machine Learning (機器學習) 概念簡單教學(入門)
https://tree.rocks/decision-tree-graphviz-contour-with-pandas-gen-train-test-dataset-for-beginner-9137b7c8416a link

小知識: 為什麽 Machine Learning 常常會看到 X, y ?
其實這是從數學領域來的，為了方便表示 y = f(X)
其中 f 為 function，即為我們的機器學習方法 ( model 或是任何演算法 )
普遍來說，X 代表輸入資料, y 代表輸出結果

Train / Test dataset 訓練集/測試集

關於 Train / Test dataset 資料概念 ( 訓練集/測試集)
再來就是分離出訓練與測試資料集，訓練集是用來訓練 Decision Tree，測試集用來測試我們的結果是否正確
這個過程其實就和我們在學校讀書考試一樣，我們透過現有的資料可以隨機分成 train / test ，用易懂的概念來說就是小考題目 / 期中考
(而對應 train / test 的答案卷, 就叫 validate 驗證集. 這句是我賴伯加上的)

假設我們有 100 題沒有做過題目，我們要考學生 ( 就是 Decision tree model )可以從裡面取出一些題目 ( 例如 80 題 = 訓練集)給他練習小考，然後他會自己去修正錯誤；到期中考的時候再把我們剩下的題目 ( 20 題 = 測試集)給他，會這樣做是因為我們不希望學生提前知道考試答案（考古題概念），不然那 20 題就測不出學生是否真的學會知識
所以一樣的概念，我們將 80 顆蘑菇給決策樹看看，最後我們用 20 顆他沒看過的蘑菇，就知道他是不是真的學會了（程度如何）

Reference

Kaggle入门，看这一篇就够了, https://zhuanlan.zhihu.com/p/25686876 link

Titanic（泰坦尼克之灾）
中文教程：机器学习系列(3)_逻辑回归应用之Kaggle泰坦尼克之灾 https://blog.csdn.net/han_xiaoyang/article/details/49797143 link
英文教程：Helge Bjorland & Stian Eide, An Interactive Data Science Tutorial-Based on the Titanic competition on Kaggle
https://www.kaggle.com/helgejo/an-interactive-data-science-tutorial link

在 user_guide/Working with text data 處, 有所有 Pandas 可以處理字串的指令說明
https://pandas.pydata.org/docs/user_guide/text.html#text-string-methods link
我發現Seachaos 的文章, 他對機器學習用到的一些慣用法及名詞, 的講解很直觀鮮明, 一聽就懂:
資料視覺化之 Decision tree (決策樹)範例與 Machine Learning (機器學習) 概念簡單教學(入門)
https://tree.rocks/decision-tree-graphviz-contour-with-pandas-gen-train-test-dataset-for-beginner-9137b7c8416a link
CHEN TSU PEI, 機器學習怎麼切分資料：訓練、驗證、測試集, https://medium.com/nlp-tsupei/%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92%E6%80%8E%E9%BA%BC%E5%88%87%E5%88%86%E8%B3%87%E6%96%99-%E8%A8%93%E7%B7%B4-%E9%A9%97%E8%AD%89-%E6%B8%AC%E8%A9%A6%E9%9B%86-f5a92576d1aa link
训练集(train)验证集(validation)测试集(test)与交叉验证法, https://zhuanlan.zhihu.com/p/35394638 link