TItanic using SVM Random Forest NB NN

import warnings
import pandas as pd
import numpy as np
import pydotplus as pydotplus
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
warnings.filterwarnings("ignore")

Information of the dataset and Preprocessing

The titanic dataset isn’t ready for the whole machine learning process. Some features are useless for classification and some values are invalid for the models we would used latter.

Overview of the Titanic dataset

titanic = pd.read_csv("D:/python/shlomo/titanic/train.csv")
titanic.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
titanic.shape
(891, 12)
titanic.dtypes.value_counts().plot.pie(explode=[0.1, 0.1, 0.1], autopct='%1.2f%%')
plt.title('Data Type');

在这里插入图片描述

The Titanic training dataset consists of 12 columns, 891 rows. Inside the dataset, only the colums ‘Pclass’, ‘Sex’, ‘Age’, ‘SibSp’, ‘Parch’, ‘Fare’, ‘Cabin’, ‘Embarked’ is useful features.

Correlation Analysis

corr=titanic.corr()#
sns.heatmap(corr, vmax=.8, linewidths=0.01, square=True,annot=True,linecolor="black")
plt.title('Correlation between features')
plt.show()

在这里插入图片描述

To emphasize the highly related pattern, we chose the correlation over 25% features showing in the heat plot below.

hig_corr = titanic.corr()
hig_corr_features = hig_corr.index[abs(hig_corr["Fare"]) >= 0.25]
hig_corr_features
Index(['Survived', 'Pclass', 'Fare'], dtype='object')
ax = sns.heatmap(titanic[hig_corr_features].corr(), annot=True, linewidth=3)
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.show()

在这里插入图片描述

Obviously there is a relation between ‘Pclass’ and ‘Fare’ for the more you paid the higher class you get on the Titanic.

After finding the highly related features, we could use a histogram to indicate the relation between ‘Pclass’ and ‘Age’.

g = sns.FacetGrid(titanic, col="Pclass")
g = g.map(plt.hist, "Age")

在这里插入图片描述

We could see ‘Pclass’=3 is significantly younger tha the other two class.

g = sns.FacetGrid(titanic, col="Pclass")
g = g.map(plt.hist, "Survived")

在这里插入图片描述

From the histogram we could see that poor people in the ‘Pclass’=3, a huge amout of them didn’t survive from the disaster.

g = sns.FacetGrid(titanic, col="Sex")
g = g.map(plt.hist, "Survived")

在这里插入图片描述

Just like the movie, chivalry like ‘female and children first’ caused the female survived number is higher than male.

NAN Analysis

def missing_value (df):
    missing_Number = df.isnull().sum().sort_values(ascending=False)[df.isnull().sum().sort_values(ascending=False) !=0]
    missing_percent=round((df.isnull().sum()/df.isnull().count())*100,2)[round((df.isnull().sum()/df.isnull().count())*100,2) !=0]
    missing = pd.concat([missing_Number,missing_percent],axis=1,keys=['Missing Number','Missing Percentage'])
    return missing
missing_values = titanic.isnull().sum()
missing_values = missing_values[missing_values > 0]
missing_values.sort_values(inplace=True)
missing_values.plot.pie(explode=[0.1, 0.1, 0.1], autopct='%1.2f%%')
plt.title('Missing Values')
Text(0.5, 1.0, 'Missing Values')

在这里插入图片描述

sns.heatmap(titanic.isnull(),cmap='cool');

在这里插入图片描述

Data cleaning

Since there’s missing data, we need to drop some rows that included NAN and the whole column ‘Cabin’

titanic = titanic.drop(['Cabin'],axis=1)
titanic['Age'] = titanic['Age'].fillna(titanic['Age'].mean())
titanic[titanic['Embarked'].isnull()]
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareEmbarked
616211Icard, Miss. Ameliefemale38.00011357280.0NaN
82983011Stone, Mrs. George Nelson (Martha Evelyn)female62.00011357280.0NaN
titanic['Embarked'] = titanic['Embarked'].fillna(method='bfill')
titanic = titanic.drop(['Name','Ticket'],axis=1)
titanic = titanic.drop(['PassengerId'],axis=1)
titanic.head()
SurvivedPclassSexAgeSibSpParchFareEmbarked
003male22.0107.2500S
111female38.01071.2833C
213female26.0007.9250S
311female35.01053.1000S
403male35.0008.0500S

One hot encoding

Since the columns ‘Sex’, ‘Embarked’ are discrete feature, we use the one hot encoding turning them into dummy variables.

titanic = pd.get_dummies(titanic,columns=['Sex','Embarked'],drop_first=True)
titanic.head()
SurvivedPclassAgeSibSpParchFareSex_maleEmbarked_QEmbarked_S
00322.0107.2500101
11138.01071.2833000
21326.0007.9250001
31135.01053.1000001
40335.0008.0500101

Train test split

The Titanic dataset didn’t separate the dataset into two half. The ‘test_data’ is a column vector which need us to predict. So we should use a random shuffle to split the train data and the test data

X = titanic.drop(['Survived'],axis=1)
y = titanic['Survived']
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=21)

Standardization

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
X_train_plot=X_train
X_test_plot=X_test

X_train = pd.DataFrame(X_train, columns=X.columns)
X_test = pd.DataFrame(X_test, columns=X.columns)
X_train.head()
PclassAgeSibSpParchFareSex_maleEmbarked_QEmbarked_S
0-1.5843960.010681-0.479698-0.460682-0.0186000.728823-0.311564-1.611198
1-1.584396-0.119643-0.479698-0.4606820.0792450.728823-0.3115640.620656
2-1.584396-0.503148-0.4796980.8106570.6466240.728823-0.311564-1.611198
3-0.381742-1.1934560.493365-0.460682-0.031329-1.372075-0.311564-1.611198
40.8209130.033758-0.479698-0.460682-0.4798180.728823-0.3115640.620656
X_test.head()
PclassAgeSibSpParchFareSex_maleEmbarked_QEmbarked_S
00.820913-0.2730450.493365-0.460682-0.315867-1.372075-0.3115640.620656
10.820913-0.809952-0.479698-0.460682-0.4854190.728823-0.3115640.620656
20.820913-0.733251-0.479698-0.460682-0.4673430.728823-0.3115640.620656
30.8209130.010681-0.479698-0.4606820.5068580.728823-0.3115640.620656
4-0.3817420.4939640.4933652.081997-0.0785960.728823-0.3115640.620656

After standardizing the data, the data is suitable for our models. We finally could train our model.

SVM

from sklearn.svm import SVC
svc = SVC()
svc.fit(X_train, y_train)
Y_pred = svc.predict(X_test)

fig, axes = plt.subplots(2, 2, figsize=(10, 10))
plt.subplot(2,2,1)
plt.scatter(X_train['Pclass'], X_train['Age'], c=y_train, cmap='viridis',label='Train')
plt.scatter(X_test['Pclass'], X_test['Age'], c=Y_pred, cmap='viridis', marker='x',label='Test')
plt.xlabel('Pclass')
plt.ylabel('Age')
plt.title('SVM Classification')

plt.subplot(2,2,2)
plt.scatter(X_train['Embarked_Q'], X_train['Parch'], c=y_train, cmap='viridis',label='Train')
plt.scatter(X_test['Embarked_Q'], X_test['Parch'], c=Y_pred, cmap='viridis', marker='x',label='Test')
plt.xlabel('Embarked_Q')
plt.ylabel('Parch')
plt.title('SVM Classification')


plt.subplot(2,2,3)
plt.scatter(X_train['Fare'], X_train['SibSp'], c=y_train, cmap='viridis',label='Train')
plt.scatter(X_test['Fare'], X_test['SibSp'], c=Y_pred, cmap='viridis', marker='x',label='Test')
plt.xlabel('Fare')
plt.ylabel('Sibsp')
plt.title('SVM Classification')


plt.subplot(2,2,4)
plt.scatter(X_train['Sex_male'], X_train['Embarked_S'], c=y_train, cmap='viridis',label='Train')
plt.scatter(X_test['Sex_male'], X_test['Embarked_S'], c=Y_pred, cmap='viridis', marker='x',label='Test')
plt.xlabel('Sex_male')
plt.ylabel('Embarked_S')
plt.title('SVM Classification')
plt.legend()
plt.show()

在这里插入图片描述

The result that SVM classification indicated upwards is not ideal. Due to it is hard to find a hyperplane which separating the dataset in two correctly. So we need PCA to dimensionally reducing the dataset as the processed one is more suitable for either SVM or visualization.

PCA

from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from mlxtend.plotting import plot_decision_regions

X_train_reduced = PCA(n_components = 2).fit_transform(X_train)
X_test_reduced  = PCA(n_components=  2).fit_transform(X_test)
svc = SVC()
svc.fit(X_train_reduced, y_train)
Y_pred = svc.predict(X_test_reduced)
t = np.array(y_train)
t = t.astype(np.integer)
plt.figure(figsize = [15,10])
plot_decision_regions(X_train_reduced, t, clf = svc, hide_spines = False, colors = 'purple,limegreen',markers = ['^','v'])
<AxesSubplot:>

在这里插入图片描述

from sklearn.model_selection import cross_val_score
scores = cross_val_score(svc,X_train_reduced,y_train,cv=5)
scores.mean()
0.7204865556978233

Finding the best parameter of different kernels

from sklearn import svm
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt

score=[]
gams = []
maxscore = 0
maxgam = 0
for gam in range(1,11):
    clf = svm.SVC(kernel='sigmoid', gamma=gam/10)
    scores = cross_val_score(clf,X_train_reduced,y_train,cv=5)
    score.append(scores.mean())
    if scores.mean()>maxscore:
        maxscore=scores.mean()
        maxgam=gam
    gams.append(gam/10)
    plt.plot(gams,score)
    plt.title('kernel=sigmoid')
    plt.xlabel('gamma')
    plt.ylabel('score')
print('kernel=sigmoid, best gamma',maxgam/10)
kernel=sigmoid, best gamma 0.1

在这里插入图片描述

score=[]
gams = []
maxscore = 0
maxgam = 0
maxpoly = 0
for gam in range(1,11):
    clf = svm.SVC(kernel='poly', gamma=gam/10)
    scores = cross_val_score(clf,X_train_reduced,y_train,cv=5)
    score.append(scores.mean())
    if scores.mean()>maxscore:
        maxscore=scores.mean()
        maxgam=gam
    gams.append(gam/10)
    plt.plot(gams,score)
    plt.title('kernel=poly')
    plt.xlabel('gamma')
    plt.ylabel('score')
print('kernel=poly, best gamma',maxgam/10)
kernel=poly, best gamma 0.2

在这里插入图片描述

score=[]
gams = []
maxscore = 0
maxgam = 0
for gam in range(1,11):
    clf = svm.SVC(kernel='rbf', gamma=gam/10)
    scores = cross_val_score(clf,X_train_reduced,y_train,cv=5)
    score.append(scores.mean())
    if scores.mean()>maxscore:
        maxscore=scores.mean()
        maxgam=gam
    gams.append(gam/10)
    plt.plot(gams,score)
    plt.title('kernel=rbf')
    plt.xlabel('gamma')
    plt.ylabel('score')
print('kernel=rbf, best gamma',maxgam/10)
kernel=rbf, best gamma 1.0

在这里插入图片描述

Comparison of different kernel

from mlxtend.plotting import plot_decision_regions
C=1.0
maxscore_among_kernels=0
bestkernel=''
models = (svm.SVC(kernel='linear', C=C),
          svm.SVC(kernel='rbf', gamma=1.0, C=C),
          svm.SVC(kernel='poly', degree=3, gamma=0.2, C=C),
          svm.SVC(kernel='sigmoid',gamma=0.1,C=C)
         )
titles = ('SVC with linear kernel',
          'SVC with RBF kernel',
          'SVC with polynomial (degree 3) kernel',
          'SVC with sigmoid kernel')
models = (clf.fit(X_train_reduced, y_train) for clf in models)

t = np.array(y_train)
t = t.astype(np.integer)

for clf,title in zip(models,titles):
    clf.fit(X_train_reduced,t)
    plt.figure(figsize = [15,10])
    plot_decision_regions(X_train_reduced, t, clf = clf, hide_spines = False, colors = 'purple,limegreen',markers = ['^','v'])
    plt.title(title)
    scores = cross_val_score(clf,X_train_reduced,y_train,cv=5)
    print(title,clf.score(X_train_reduced,y_train))
    if scores.mean()>maxscore_among_kernels:
        maxscore_among_kernels=scores.mean()
        bestkernel=title
    print("%s mean ACC:%f"%(title,scores.mean()))
print('best kernel is:', bestkernel)
SVC with linear kernel 0.675561797752809
SVC with linear kernel mean ACC:0.678351
SVC with RBF kernel 0.7598314606741573
SVC with RBF kernel mean ACC:0.737349
SVC with polynomial (degree 3) kernel 0.6573033707865169
SVC with polynomial (degree 3) kernel mean ACC:0.653068
SVC with sigmoid kernel 0.6334269662921348
SVC with sigmoid kernel mean ACC:0.651650
best kernel is: SVC with RBF kernel

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

C=1.0
svc = SVC(kernel='rbf', gamma=1.0, C=C)
svc.fit(X_train_reduced, y_train)
y_pred_svc = svc.predict(X_test_reduced)
y_pred_svc
array([1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0,
       1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0], dtype=int64)

Naive Bayes Classifier

In this section we could use more distribution function for the naive Bayes classifier, but we encounter that MultinomialNB cannot receive a less than zero input. So in this section we will only be using the Gaussian and Bernoulli distribution.

from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred_gnb = gnb.predict(X_test)
bnb = BernoulliNB()
bnb.fit(X_train, y_train)
y_pred_bnb = gnb.predict(X_test)
y_pred_gnb,y_pred_bnb
(array([1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
        0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
        1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1,
        1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0,
        0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0,
        0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1,
        1, 0, 0], dtype=int64),
 array([1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
        0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
        1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1,
        1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0,
        0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0,
        0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1,
        1, 0, 0], dtype=int64))

Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, y_train)
y_pred_rf = random_forest.predict(X_test)
random_forest.score(X_train, y_train)

random_forest_train = round(random_forest.score(X_train, y_train) * 100, 2)
random_forest_accuracy = round(accuracy_score(y_pred_rf, y_test) * 100, 2)
score = cross_val_score(random_forest,X_train,y_train,cv=5)
print(score.mean())
print("Training Accuracy     :",random_forest_train)
print("Model Accuracy Score  :",random_forest_accuracy)
0.7893725992317541
Training Accuracy     : 98.6
Model Accuracy Score  : 82.12
from sklearn import tree
import graphviz

dot_data = tree.export_graphviz(random_forest.estimators_[-1], out_file=None, feature_names=X_train.columns, class_names=['0','1'],filled=True,rounded=True,special_characters=True)
graph = graphviz.Source(dot_data)
graph.view()
'Source.gv.pdf'

For it is way too big to demonstrate the random forest in the .ipynb, we only screenshot part of it for overviewing. The function ‘graph.view’ would open the whole random forest as a pdf file.

在这里插入图片描述

Neron Network

from keras.models import Sequential
from keras.layers import Dense
from sklearn.model_selection import train_test_split

model = Sequential()
model.add(Dense(units=64, activation='relu', input_dim=X_train.shape[1]))
model.add(Dense(units=1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 64)                576       
                                                                 
 dense_1 (Dense)             (None, 1)                 65        
                                                                 
=================================================================
Total params: 641
Trainable params: 641
Non-trainable params: 0
_________________________________________________________________
model.fit(X_train, y_train, epochs=1000, batch_size=32, verbose=1)
loss, accuracy = model.evaluate(X_test, y_test)
print("Test Loss:", loss)
print("Test Accuracy:", accuracy)
Epoch 1000/1000
23/23 [==============================] - 0s 1ms/step - loss: 0.2903 - accuracy: 0.8778
6/6 [==============================] - 0s 1ms/step - loss: 0.4956 - accuracy: 0.8380
Test Loss: 0.49561816453933716
Test Accuracy: 0.8379888534545898

The neuron network’ s result is not suitable for the Titanic case since it’s output is not binary. We need to transform the result into binary.

y_pred_nn = model.predict(X_test)
y_pred_nn =binary_predictions = np.where(y_pred_bnb >= 0.5, 1, 0)
y_pred_nn
6/6 [==============================] - 0s 800us/step





array([1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1,
       1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1,
       1, 0, 0])

Model evaluation

In this section we will be using method confusion matrix, accuracy, precision score, recall score, F1 score to evaluate the performance.

Confusion matrix

The confusion matrix is a tabular representation that provides a detailed breakdown of the performance of a classification model. It summarizes the predictions made by the model and compares them to the actual class labels of the dataset.

from sklearn.metrics import confusion_matrix
import seaborn as sns

# Compute the confusion matrix
cm_svm = confusion_matrix(y_test, y_pred_svc)
cm_bnb = confusion_matrix(y_test,y_pred_bnb)
cm_gnb = confusion_matrix(y_test,y_pred_gnb)
cm_rf = confusion_matrix(y_test, y_pred_rf)
cm_nn = confusion_matrix(y_test, y_pred_nn)

# plot confusion matrix
fig, axes = plt.subplots(5, 1, figsize=(12, 20))
plt.subplots_adjust(hspace=0.5, wspace=0.5)

plt.subplot(5,1,1)
sns.heatmap(cm_svm, annot=True, fmt='d', cmap='plasma')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix_SVM_kernel=\'rbf\'')
plt.subplot(5,1,2)
sns.heatmap(cm_bnb, annot=True, fmt='d', cmap='plasma')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix_Naive Bayes_Bernoulli')
plt.subplot(5,1,3)
sns.heatmap(cm_gnb, annot=True, fmt='d', cmap='plasma')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix_Naive Bayes_Gaussian')
plt.subplot(5,1,4)
sns.heatmap(cm_rf, annot=True, fmt='d', cmap='plasma')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix_Random Forest')
plt.subplot(5,1,5)
sns.heatmap(cm_rf, annot=True, fmt='d', cmap='plasma')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix_Neron Network')

plt.show()

在这里插入图片描述

Accuracy measures the overall correctness of a classification model. It is the ratio of the correctly predicted samples to the total number of samples in the dataset. Precision measures the proportion of correctly predicted positive samples out of the total predicted positive samples. It focuses on the accuracy of positive predictions.Recall, also known as sensitivity or true positive rate, measures the proportion of correctly predicted positive samples out of the total actual positive samples.The F1 score is a harmonic mean of precision and recall. It provides a single metric that balances both precision and recall.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Compute accuracy
accuracy_svm = accuracy_score(y_test, y_pred_svc)
accuracy_bnb = accuracy_score(y_test, y_pred_bnb)
accuracy_gnb = accuracy_score(y_test, y_pred_gnb)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
accuracy_nn = accuracy_score(y_test, y_pred_nn)


# Compute precision
precision_svm = precision_score(y_test, y_pred_svc)
precision_bnb = precision_score(y_test, y_pred_bnb)
precision_gnb = precision_score(y_test, y_pred_gnb)
precision_rf = precision_score(y_test, y_pred_rf)
precision_nn = precision_score(y_test, y_pred_nn)

# Compute recall
recall_svm = recall_score(y_test, y_pred_svc)
recall_bnb = recall_score(y_test, y_pred_bnb)
recall_gnb = recall_score(y_test, y_pred_gnb)
recall_rf = recall_score(y_test, y_pred_rf)
recall_nn = recall_score(y_test, y_pred_nn)

# Compute F-score
fscore_svm = f1_score(y_test, y_pred_svc)
fscore_bnb = f1_score(y_test, y_pred_bnb)
fscore_gnb = f1_score(y_test, y_pred_gnb)
fscore_rf = f1_score(y_test, y_pred_rf)
fscore_nn = f1_score(y_test, y_pred_nn)

# Create a DataFrame
result = {
    'Accuracy': [accuracy_svm, accuracy_bnb, accuracy_gnb, accuracy_rf, accuracy_nn],
    'Precision': [precision_svm, precision_bnb, precision_gnb, precision_rf, precision_nn],
    'Recall': [recall_svm, recall_bnb, recall_gnb, recall_rf, recall_nn],
    'F1-score': [fscore_svm, fscore_bnb, fscore_gnb, fscore_rf, fscore_nn]
}
score_df = pd.DataFrame(result, index=['SVM', 'BNB', 'GNB', 'RF', 'NN'])
score_df
AccuracyPrecisionRecallF1-score
SVM0.6703910.6315790.4864860.549618
BNB0.8156420.8059700.7297300.765957
GNB0.8156420.8059700.7297300.765957
RF0.8212290.8181820.7297300.771429
NN0.8156420.8059700.7297300.765957
print("Best ACC performance model:", score_df['Accuracy'].idxmax())
print("Best Precision performance model:", score_df['Precision'].idxmax())
print("Best Recall performance model:", score_df['Recall'].idxmax())
print("Best F1 performance model:", score_df['F1-score'].idxmax())
Best ACC performance model: RF
Best Precision performance model: RF
Best Recall performance model: BNB
Best F1 performance model: RF
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, plot_roc_curve

# Compute ROC curve for each model
fpr_svm, tpr_svm, _ = roc_curve(y_test, y_pred_svc)
fpr_bnb, tpr_bnb, _ = roc_curve(y_test, y_pred_bnb)
fpr_gnb, tpr_gnb, _ = roc_curve(y_test, y_pred_gnb)
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_pred_rf)
fpr_nn, tpr_nn, _ = roc_curve(y_test, y_pred_nn)

# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr_svm, tpr_svm, label='SVM')
plt.plot(fpr_bnb, tpr_bnb, label='BNB')
plt.plot(fpr_gnb, tpr_gnb, label='GNB')
plt.plot(fpr_rf, tpr_rf, label='RF')
plt.plot(fpr_nn, tpr_nn, label='NN')
plt.plot([0, 1], [0, 1], 'k--')  # Diagonal line
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend()
plt.show()

在这里插入图片描述

AUC is used for evaluating the performance of binary classification models based on the Receiver Operating Characteristic (ROC) curve. The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various classification thresholds. AUC represents the area under this curve and provides an aggregate measure of the model’s ability to discriminate between positive and negative classes. A higher AUC value indicates better overall performance, with a value of 1 representing a perfect classifier and a value of 0.5 indicating a random classifier.

from sklearn.metrics import roc_auc_score

# Compute AUC for each model
auc_svm = roc_auc_score(y_test, y_pred_svc)
auc_bnb = roc_auc_score(y_test, y_pred_bnb)
auc_gnb = roc_auc_score(y_test, y_pred_gnb)
auc_rf = roc_auc_score(y_test, y_pred_rf)
auc_nn = roc_auc_score(y_test, y_pred_nn)

# Create a DataFrame for AUC
auc = {
    'AUC': [auc_svm, auc_bnb, auc_gnb, auc_rf, auc_nn]
}

df_auc = pd.DataFrame(auc, index=['SVM', 'BNB', 'GNB', 'RF', 'NN'])

# Output the DataFrame
print(df_auc)
print("Best AUC performed model:",df_auc.idxmax())
          AUC
SVM  0.643243
BNB  0.802960
GNB  0.802960
RF   0.807722
NN   0.802960
Best AUC performed model: AUC    RF
dtype: object

Comprehensively, Random Forest Classifier has the best performance in the previous evaluation. For binary classification, Random Forest is an ensemble learning method that combines multiple decision trees. The model aggregates the predictions of individual trees and selects the majority vote or average prediction, resulting in a more robust and accurate final prediction.Random Forest can effectively capture nonlinear relationships between features and the target variable. Each decision tree in the ensemble is constructed based on different random subsets of features, allowing the model to learn diverse patterns and capture complex interactions between variables. This flexibility makes Random Forest well-suited for capturing complex decision boundaries and handling nonlinear relationships in the data.Random Forest is robust to outliers and irrelevant features. Outliers have limited impact on the overall model performance as each decision tree in the ensemble is built independently. Additionally, the random feature selection process ensures that irrelevant features have a diminished influence on the final predictions, resulting in a more focused and accurate model.

  • 2
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值