机器学习 - Titanic幸存者预测保姆级

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
df = pd.read_csv('/Users/gaoliang/Documents/Kaggle/titanic/train.csv') 
df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
df.isna().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
# check if the dataset is balanced by counting the unique vslues of the target variable:
df.Survived.value_counts()
0    549
1    342
Name: Survived, dtype: int64
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
df.shape
(891, 12)
# we do not need column PassengerId. We can drop it as follow:
df.drop(columns = ['PassengerId'], inplace = True)

# DataFrame does not change the original date. It simply produces a new copy. 
# Alternatively, we can write:
# df = df.drop(columns = ['PassengerId'])
df.head()
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
003Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
111Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
213Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
311Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
403Allen, Mr. William Henrymale35.0003734508.0500NaNS
df.drop(columns = ['Cabin','Ticket','Name'],inplace = True)
# change Fare to Price 
df.rename(columns = {'Fare':'Price'},inplace = True)
df.head()
SurvivedPclassSexAgeSibSpParchPriceEmbarked
003male22.0107.2500S
111female38.01071.2833C
213female26.0007.9250S
311female35.01053.1000S
403male35.0008.0500S
# Let's plot a histogram of Fare:
df.Price.hist(bins=100)
<AxesSubplot:>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-QyqhtZW3-1665478037452)(output_11_1.png)]

# we need to replace the values of the column 'Sex':
#Solution 1: 
df.loc[df['Sex'] == 'female','Sex'] = 0
df.loc[df['Sex'] == 'male','Sex'] = 1
df.head(5)
SurvivedPclassSexAgeSibSpParchPriceEmbarked
003122.0107.2500S
111038.01071.2833C
213026.0007.9250S
311035.01053.1000S
403135.0008.0500S
# 2 
df2 = df.copy()

def Sex2Num(Sex_String):
    if Sex_String == 'female':
        return 0
    elif Sex_String == 'male':
        return 1
    else:
        return Sex_String
df2['Sex'] = df2['Sex'].apply(Sex2Num)
df2.head(3)
SurvivedPclassSexAgeSibSpParchPriceEmbarked
003122.0107.2500S
111038.01071.2833C
213026.0007.9250S
# 3
df3 = df.copy()
df3['Sex'] = df3['Sex'].apply(lambda x:0 if x == 'female' else 1 if x == 'male' else x)
df3.head(3)
SurvivedPclassSexAgeSibSpParchPriceEmbarked
003122.0107.2500S
111038.01071.2833C
213026.0007.9250S

pandas.get_dummies() allows us to convert a categorical variable with k possible values into k new binary variables called dummy variables. This conversion is also called one-hot encoding in computer science. Below, we convert column Embarked into dummies.

  • Note that dummy conversion is meaningful only if k is small. Otherwise, it creates too many new independent variables, each carrying only negligible amount of informatin.
df = pd.get_dummies(df, columns = ['Embarked'])
df.head(10)
SurvivedPclassSexAgeSibSpParchPriceEmbarked_CEmbarked_QEmbarked_S
003122.0107.2500001
111038.01071.2833100
213026.0007.9250001
311035.01053.1000001
403135.0008.0500001
5031NaN008.4583010
601154.00051.8625001
70312.03121.0750001
813027.00211.1333001
912014.01030.0708100

We then need to drop one of the created dummies to avoid the multicollinearity problem. Let’s drop the most frequent one, Embarked_S.

df.drop(columns = 'Embarked_S', inplace = True)
df.head()
SurvivedPclassSexAgeSibSpParchPriceEmbarked_CEmbarked_Q
003122.0107.250000
111038.01071.283310
213026.0007.925000
311035.01053.100000
403135.0008.050000

Rearrange column order

Suppose we want to move column Pclass to after column Parch. We can do so in two ways.

  • Use DataFrame.reindex(columns=[the columns in the order that you want])
  • Or, use the following:
df = df[['Survived','Sex','Age','SibSp','Parch','Pclass','Price','Embarked_C','Embarked_Q']]
# hint: we can use df.columns.to_list() to first produce the old column order, then copy & edit
df.head()
SurvivedSexAgeSibSpParchPclassPriceEmbarked_CEmbarked_Q
00122.01037.250000
11038.010171.283310
21026.00037.925000
31035.010153.100000
40135.00038.050000

# 要把 特征和标签分开, sklearn 是分开导入的

# Separate the data into the feature matrix and the target array
X = df.drop(columns=['Survived'])
y = df['Survived']

# Next, split train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                               test_size=0.2, # reserve 20% data for testing
                               random_state=365)

# (Not required in our class) The following is to avoid the well-known SettingWithCopyWarning
# associated with the problematic implementation of train_test_split()
#X_train = X_train.copy()
# X_test = X_test.copy()

print(X_train.shape)
print(X_test.shape)
(712, 8)
(179, 8)
# Any missing data?
df.isna().sum()
Survived        0
Pclass          0
Sex             0
Age           177
SibSp           0
Parch           0
Price           0
Embarked_C      0
Embarked_Q      0
dtype: int64
# This dataset has missing values in column Age, which we need to impute first
X_train_Age_mean = X_train['Age'].mean()
X_train['Age'] = X_train['Age'].fillna(X_train_Age_mean)

# Verify that there's no more missing values:
X_train.Age.isna().sum()
0
# Important: make sure to do exactly the same data wrangling over the test dataset!
X_test['Age'] = X_test['Age'].fillna(X_train_Age_mean)
df.isna().sum()
Survived        0
Pclass          0
Sex             0
Age           177
SibSp           0
Parch           0
Price           0
Embarked_C      0
Embarked_Q      0
dtype: int64
X_test.Age.isna().sum()
0
# Let's try logistic regression as the learning algorithm
# First, load the package
from sklearn.linear_model import LogisticRegression
# Next, set the hyperparameters of this classifier
clf_lr = LogisticRegression(
    penalty='none', # Otherwise regularization will happen (to study later)
    max_iter=1000) # The model didn't converge with default 100 iterations
# Next, fit (a.k.a. train) this model over the train dataset
clf_lr.fit(X_train,y_train)
LogisticRegression(max_iter=1000, penalty='none')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(max_iter=1000, penalty='none')
# Run this code cell to observe the coefficients of the trained model:
coef_lr = pd.DataFrame(clf_lr.coef_[0],index=X_train.columns,columns=['coefficient'])
coef_lr.transpose()
PclassSexAgeSibSpParchPriceEmbarked_CEmbarked_Q
coefficient-0.995441-2.602657-0.032009-0.346223-0.0595040.0033780.2466010.346033

One weakness of the scikit-learn package, as compared to R packages, is that it is more into prediction and less into the completeness of statistics reporting.
For example, LogisticRegression does not report the p-value. If you need it, try another package statsmodels as follows:

import statsmodels.api as sm
logit_model=sm.Logit(y_train.astype(float),sm.add_constant(X_train.astype(float)))
result=logit_model.fit()
print(result.summary())
Optimization terminated successfully.
         Current function value: 0.452236
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:               Survived   No. Observations:                  712
Model:                          Logit   Df Residuals:                      703
Method:                           MLE   Df Model:                            8
Date:                Tue, 11 Oct 2022   Pseudo R-squ.:                  0.3193
Time:                        11:24:58   Log-Likelihood:                -321.99
converged:                       True   LL-Null:                       -473.03
Covariance Type:            nonrobust   LLR p-value:                 1.491e-60
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.2980      0.577      7.446      0.000       3.167       5.429
Pclass        -0.9946      0.157     -6.337      0.000      -1.302      -0.687
Sex           -2.6030      0.219    -11.902      0.000      -3.032      -2.174
Age           -0.0320      0.008     -3.761      0.000      -0.049      -0.015
SibSp         -0.3462      0.123     -2.818      0.005      -0.587      -0.105
Parch         -0.0598      0.139     -0.431      0.666      -0.331       0.212
Price          0.0034      0.003      1.247      0.213      -0.002       0.009
Embarked_C     0.2466      0.260      0.947      0.344      -0.264       0.757
Embarked_Q     0.3443      0.376      0.915      0.360      -0.393       1.082
==============================================================================


/Users/gaoliang/opt/anaconda3/lib/python3.9/site-packages/statsmodels/tsa/tsatools.py:142: FutureWarning: In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only
  x = pd.concat(x[::order], 1)
# Now back to LogisticRegression in scikit-learn. Let's evaluate the performance of the 
# trained model. To do so, we first use the trained model to predict the test dataset.
y_predict = clf_lr.predict(X_test)
# Then, compare the predicted values with the truth to get accuracy
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_predict).round(4)
0.8156
# Observe the confusion matrix
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_predict))
[[96 12]
 [21 50]]

When it comes to creating a trained algorithm (a.k.a., a trained model, or model), we know there are many possible choices:

  • There are many learning algorithms to choose from.
    • E.g., regression, trees, kNN, SVM, neural network, emsemble, …
  • Also importantly, most learning algorithms have hyperparameters that need to be set before training. The choices of these hyperparameters may result in different trained models.
    • E.g., for decision tree learning, how deep to allow a tree to grow? For regression, should we consider regularization? For kNN, what value of k to set? …

There is no free lunch – there is no sure choice that dominates all other choices. Otherwise, we wouldn’t see all these choices in today’s analytics practices.

In this lecture, we will try and discuss the pros/cons of a few popular learning algorithms. We leave the topic of hyperparameter tuning, as well as the discussion of the state-of-the-art boosting-based algorithms (that always require hyperparameter tuning), to the next lecture.

# A template for implementing various supervised learning algorithms 
# I assume that, prior to running this code, we have already pre-processed the data

# Load the learning algorithm
from sklearn.linear_model import LogisticRegression

# Set the hyperparameters of this algorithm
clf = LogisticRegression(penalty='none', max_iter=1000)

# Fit the model over the train data
clf.fit(X_train,y_train)

# Use the fitted model to predict the test data
y_predict = clf.predict(X_test)

# Obtain performance metrics
accuracy = accuracy_score(y_test, y_predict).round(4)
print(f"The accuracy is: {accuracy:.2%}")
print("The confusion matrix is:")
cm = confusion_matrix(y_test, y_predict)
print(cm)

# Save the model and the performance metrics for later comparison.
# Here I use suffix "lr" because we just tried logistic regression.
# Change the suffix when you switch to a new learning algorithm!
clf_lr = clf
accuracy_lr = accuracy
cm_lr = cm
The accuracy is: 81.56%
The confusion matrix is:
[[96 12]
 [21 50]]
# k-Nearest Neighbors (kNN)
from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors=3)

clf.fit(X_train,y_train)

y_predict = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_predict).round(4)
print(f"The accuracy is: {accuracy:.2%}")
print("The confusion matrix is:")
cm = confusion_matrix(y_test, y_predict)
print(cm)

# save the results for later comparison
clf_knn = clf
accuracy_knn = accuracy
cm_knn = cm
The accuracy is: 70.95%
The confusion matrix is:
[[92 16]
 [36 35]]
# Decision Trees
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(max_depth=2)
clf.fit(X_train,y_train)

y_predict = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_predict).round(4)
print(f"The accuracy is: {accuracy:.2%}")
print("The confusion matrix is:")
cm = confusion_matrix(y_test, y_predict)
print(cm)

# save the results for later comparison
clf_dt = clf
accuracy_dt = accuracy
cm_dt = cm
The accuracy is: 77.65%
The confusion matrix is:
[[104   4]
 [ 36  35]]

Plotting the trained tree

One advantage of decision tree learning is that the trained model is often intuitive to human beings. Therefore, despite its often inferior performance especially for large and complicated datasets, analysts use it a lot in practice for understanding the data and for communication with others. Let’s plot the trained tree I just got.

from sklearn import tree
import matplotlib.pyplot as plt
# warning: if the tree is too big to read, limit the max_depth of the tree during training

plt.figure(figsize=(15,10))  # set plot size (denoted in inches)
tree.plot_tree(clf_dt,
               feature_names=X_train.columns,
               filled = True,
               fontsize=12)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GbHXlTTi-1665478037454)(output_47_0.png)]

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X_train,y_train)

y_predict = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_predict).round(4)
print(f"The accuracy is: {accuracy:.2%}")
print("The confusion matrix is:")
cm = confusion_matrix(y_test, y_predict)
print(cm)

# save the results for later comparison
clf_rf = clf
accuracy_rf = accuracy
cm_rf = cm
The accuracy is: 79.33%
The confusion matrix is:
[[107   1]
 [ 36  35]]

Using random forest for ranking the importance of features

A handy feature of RandomForestClassifier is that it provides a robust ranking of the relative importance of all input variables.

  • Often subsequently used for manual feature selection.
  • Note that this ranking gets around the (messy) choice among correlated variables problem.
importances = clf_rf.feature_importances_
pd.Series(importances, index=X_train.columns).sort_values(ascending=False)
Sex           0.322660
Price         0.215141
Pclass        0.201329
Age           0.122027
SibSp         0.070367
Parch         0.031325
Embarked_C    0.028627
Embarked_Q    0.008524
dtype: float64

A recap of the learning algorithms in scikit-learn package

The scikit-learn package contains a large selection of traditional supervised learning algorithms, with excellent documentation and coding examples.

  • “traditional” means not using the deep learning approach

I expect you to be able to use the following learning algorithms:

  • Linear Models including both linear regression and logistic regression
    • For better predictive power, use Python
    • For better explanatory power, use R
  • Decision Trees
  • Nearest Neighbors
    • simple, only one hyperparameter, and can be used for imputation
  • Ensemble methods including random forest and XGBoost (next lecture)
  • Neural Networks (to study in the second-half of the semester)

(Models NOT required for this course) It is a good idea for you to at least read a bit about the following learning algorithms:

  • Support Vector Machines – the idea is to use a hyperplane to separate data in a high-dimensional space; was very popular before ensemble methods took off
  • Naive Bayes – based on the Bayes’ Theorem
  • Stochastic Gradient Descent (SGD) – It provides an efficient computational approach to fitting supervised learning models, especially when the data is big. This is a core fitting method used in deep learning (where the data is almost always big).

You should also be familiar with the concept of regularization that is commonly used in machine learning (see Part 4 of this lecture)

  • purpose is to control overfitting
  • two variations
    • L1 (a.k.a. Lasso) regularization
    • L2 (a.k.a. Ridge) regularization
    • (if you use both L1 and L2, it’s called Elstic-Net)

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值