TItanic using SVM Random Forest NB NN

最新推荐文章于 2024-09-10 23:56:09 发布

送快递的勃仕

最新推荐文章于 2024-09-10 23:56:09 发布

阅读量210

点赞数 2

文章标签：支持向量机随机森林算法

本文链接：https://blog.csdn.net/tew_315/article/details/131641499

版权

import warnings
import pandas as pd
import numpy as np
import pydotplus as pydotplus
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
warnings.filterwarnings("ignore")

Information of the dataset and Preprocessing

The titanic dataset isn’t ready for the whole machine learning process. Some features are useless for classification and some values are invalid for the models we would used latter.

Overview of the Titanic dataset

titanic = pd.read_csv("D:/python/shlomo/titanic/train.csv")
titanic.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

titanic.shape

(891, 12)

titanic.dtypes.value_counts().plot.pie(explode=[0.1, 0.1, 0.1], autopct='%1.2f%%')
plt.title('Data Type');

在这里插入图片描述

The Titanic training dataset consists of 12 columns, 891 rows. Inside the dataset, only the colums ‘Pclass’, ‘Sex’, ‘Age’, ‘SibSp’, ‘Parch’, ‘Fare’, ‘Cabin’, ‘Embarked’ is useful features.

Correlation Analysis

corr=titanic.corr()#
sns.heatmap(corr, vmax=.8, linewidths=0.01, square=True,annot=True,linecolor="black")
plt.title('Correlation between features')
plt.show()

在这里插入图片描述

To emphasize the highly related pattern, we chose the correlation over 25% features showing in the heat plot below.

hig_corr = titanic.corr()
hig_corr_features = hig_corr.index[abs(hig_corr["Fare"]) >= 0.25]
hig_corr_features

Index(['Survived', 'Pclass', 'Fare'], dtype='object')

ax = sns.heatmap(titanic[hig_corr_features].corr(), annot=True, linewidth=3)
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.show()

在这里插入图片描述

Obviously there is a relation between ‘Pclass’ and ‘Fare’ for the more you paid the higher class you get on the Titanic.

After finding the highly related features, we could use a histogram to indicate the relation between ‘Pclass’ and ‘Age’.

g = sns.FacetGrid(titanic, col="Pclass")
g = g.map(plt.hist, "Age")

在这里插入图片描述

We could see ‘Pclass’=3 is significantly younger tha the other two class.

g = sns.FacetGrid(titanic, col="Pclass")
g = g.map(plt.hist, "Survived")

在这里插入图片描述

From the histogram we could see that poor people in the ‘Pclass’=3, a huge amout of them didn’t survive from the disaster.

g = sns.FacetGrid(titanic, col="Sex")
g = g.map(plt.hist, "Survived")

在这里插入图片描述

Just like the movie, chivalry like ‘female and children first’ caused the female survived number is higher than male.

NAN Analysis

def missing_value (df):
    missing_Number = df.isnull().sum().sort_values(ascending=False)[df.isnull().sum().sort_values(ascending=False) !=0]
    missing_percent=round((df.isnull().sum()/df.isnull().count())*100,2)[round((df.isnull().sum()/df.isnull().count())*100,2) !=0]
    missing = pd.concat([missing_Number,missing_percent],axis=1,keys=['Missing Number','Missing Percentage'])
    return missing

missing_values = titanic.isnull().sum()
missing_values = missing_values[missing_values > 0]
missing_values.sort_values(inplace=True)
missing_values.plot.pie(explode=[0.1, 0.1, 0.1], autopct='%1.2f%%')
plt.title('Missing Values')

Text(0.5, 1.0, 'Missing Values')

在这里插入图片描述

sns.heatmap(titanic.isnull(),cmap='cool');

在这里插入图片描述

Data cleaning

Since there’s missing data, we need to drop some rows that included NAN and the whole column ‘Cabin’

titanic = titanic.drop(['Cabin'],axis=1)

titanic['Age'] = titanic['Age'].fillna(titanic['Age'].mean())

titanic[titanic['Embarked'].isnull()]

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Embarked
61	62	1	1	Icard, Miss. Amelie	female	38.0	0	0	113572	80.0	NaN
829	830	1	1	Stone, Mrs. George Nelson (Martha Evelyn)	female	62.0	0	0	113572	80.0	NaN

titanic['Embarked'] = titanic['Embarked'].fillna(method='bfill')

titanic = titanic.drop(['Name','Ticket'],axis=1)

titanic = titanic.drop(['PassengerId'],axis=1)

titanic.head()

	Survived	Pclass	Sex	Age	SibSp	Fare	Embarked
0	0	3	male	22.0	1	7.2500	S
1	1	1	female	38.0	1	71.2833	C
2	1	3	female	26.0	0	7.9250	S
3	1	1	female	35.0	1	53.1000	S
4	0	3	male	35.0	0	8.0500	S

One hot encoding

Since the columns ‘Sex’, ‘Embarked’ are discrete feature, we use the one hot encoding turning them into dummy variables.

titanic = pd.get_dummies(titanic,columns=['Sex','Embarked'],drop_first=True)
titanic.head()

	Survived	Pclass	Age	SibSp	Fare	Sex_male	Embarked_S
0	0	3	22.0	1	7.2500	1	1
1	1	1	38.0	1	71.2833	0	0
2	1	3	26.0	0	7.9250	0	1
3	1	1	35.0	1	53.1000	0	1
4	0	3	35.0	0	8.0500	1	1

Train test split

The Titanic dataset didn’t separate the dataset into two half. The ‘test_data’ is a column vector which need us to predict. So we should use a random shuffle to split the train data and the test data

X = titanic.drop(['Survived'],axis=1)
y = titanic['Survived']

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=21)

Standardization

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
X_train_plot=X_train
X_test_plot=X_test

X_train = pd.DataFrame(X_train, columns=X.columns)
X_test = pd.DataFrame(X_test, columns=X.columns)

X_train.head()

	Pclass	Age	SibSp	Parch	Fare	Sex_male	Embarked_Q	Embarked_S
0	-1.584396	0.010681	-0.479698	-0.460682	-0.018600	0.728823	-0.311564	-1.611198
1	-1.584396	-0.119643	-0.479698	-0.460682	0.079245	0.728823	-0.311564	0.620656
2	-1.584396	-0.503148	-0.479698	0.810657	0.646624	0.728823	-0.311564	-1.611198
3	-0.381742	-1.193456	0.493365	-0.460682	-0.031329	-1.372075	-0.311564	-1.611198
4	0.820913	0.033758	-0.479698	-0.460682	-0.479818	0.728823	-0.311564	0.620656

X_test.head()

	Pclass	Age	SibSp	Parch	Fare	Sex_male	Embarked_Q	Embarked_S
0	0.820913	-0.273045	0.493365	-0.460682	-0.315867	-1.372075	-0.311564	0.620656
1	0.820913	-0.809952	-0.479698	-0.460682	-0.485419	0.728823	-0.311564	0.620656
2	0.820913	-0.733251	-0.479698	-0.460682	-0.467343	0.728823	-0.311564	0.620656
3	0.820913	0.010681	-0.479698	-0.460682	0.506858	0.728823	-0.311564	0.620656
4	-0.381742	0.493964	0.493365	2.081997	-0.078596	0.728823	-0.311564	0.620656

After standardizing the data, the data is suitable for our models. We finally could train our model.

SVM

from sklearn.svm import SVC
svc = SVC()
svc.fit(X_train, y_train)
Y_pred = svc.predict(X_test)

fig, axes = plt.subplots(2, 2, figsize=(10, 10))
plt.subplot(2,2,1)
plt.scatter(X_train['Pclass'], X_train['Age'], c=y_train, cmap='viridis',label='Train')
plt.scatter(X_test['Pclass'], X_test['Age'], c=Y_pred, cmap='viridis', marker='x',label='Test')
plt.xlabel('Pclass')
plt.ylabel('Age')
plt.title('SVM Classification')

plt.subplot(2,2,2)
plt.scatter(X_train['Embarked_Q'], X_train['Parch'], c=y_train, cmap='viridis',label='Train')
plt.scatter(X_test['Embarked_Q'], X_test['Parch'], c=Y_pred, cmap='viridis', marker='x',label='Test')
plt.xlabel('Embarked_Q')
plt.ylabel('Parch')
plt.title('SVM Classification')


plt.subplot(2,2,3)
plt.scatter(X_train['Fare'], X_train['SibSp'], c=y_train, cmap='viridis',label='Train')
plt.scatter(X_test['Fare'], X_test['SibSp'], c=Y_pred, cmap='viridis', marker='x',label='Test')
plt.xlabel('Fare')
plt.ylabel('Sibsp')
plt.title('SVM Classification')


plt.subplot(2,2,4)
plt.scatter(X_train['Sex_male'], X_train['Embarked_S'], c=y_train, cmap='viridis',label='Train')
plt.scatter(X_test['Sex_male'], X_test['Embarked_S'], c=Y_pred, cmap='viridis', marker='x',label='Test')
plt.xlabel('Sex_male')
plt.ylabel('Embarked_S')
plt.title('SVM Classification')
plt.legend()
plt.show()

在这里插入图片描述

The result that SVM classification indicated upwards is not ideal. Due to it is hard to find a hyperplane which separating the dataset in two correctly. So we need PCA to dimensionally reducing the dataset as the processed one is more suitable for either SVM or visualization.

PCA

from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from mlxtend.plotting import plot_decision_regions

X_train_reduced = PCA(n_components = 2).fit_transform(X_train)
X_test_reduced  = PCA(n_components=  2).fit_transform(X_test)
svc = SVC()
svc.fit(X_train_reduced, y_train)
Y_pred = svc.predict(X_test_reduced)
t = np.array(y_train)
t = t.astype(np.integer)
plt.figure(figsize = [15,10])
plot_decision_regions(X_train_reduced, t, clf = svc, hide_spines = False, colors = 'purple,limegreen',markers = ['^','v'])

<AxesSubplot:>

在这里插入图片描述

from sklearn.model_selection import cross_val_score
scores = cross_val_score(svc,X_train_reduced,y_train,cv=5)
scores.mean()

0.7204865556978233

Finding the best parameter of different kernels

from sklearn import svm
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt

score=[]
gams = []
maxscore = 0
maxgam = 0
for gam in range(1,11):
    clf = svm.SVC(kernel='sigmoid', gamma=gam/10)
    scores = cross_val_score(clf,X_train_reduced,y_train,cv=5)
    score.append(scores.mean())
    if scores.mean()>maxscore:
        maxscore=scores.mean()
        maxgam=gam
    gams.append(gam/10)
    plt.plot(gams,score)
    plt.title('kernel=sigmoid')
    plt.xlabel('gamma')
    plt.ylabel('score')
print('kernel=sigmoid, best gamma',maxgam/10)

kernel=sigmoid, best gamma 0.1

在这里插入图片描述

score=[]
gams = []
maxscore = 0
maxgam = 0
maxpoly = 0
for gam in range(1,11):
    clf = svm.SVC(kernel='poly', gamma=gam/10)
    scores = cross_val_score(clf,X_train_reduced,y_train,cv=5)
    score.append(scores.mean())
    if scores.mean()>maxscore:
        maxscore=scores.mean()
        maxgam=gam
    gams.append(gam/10)
    plt.plot(gams,score)
    plt.title('kernel=poly')
    plt.xlabel('gamma')
    plt.ylabel('score')
print('kernel=poly, best gamma',maxgam/10)

kernel=poly, best gamma 0.2

在这里插入图片描述

score=[]
gams = []
maxscore = 0
maxgam = 0
for gam in range(1,11):
    clf = svm.SVC(kernel='rbf', gamma=gam/10)
    scores = cross_val_score(clf,X_train_reduced,y_train,cv=5)
    score.append(scores.mean())
    if scores.mean()>maxscore:
        maxscore=scores.mean()
        maxgam=gam
    gams.append(gam/10)
    plt.plot(gams,score)
    plt.title('kernel=rbf')
    plt.xlabel('gamma')
    plt.ylabel('score')
print('kernel=rbf, best gamma',maxgam/10)

kernel=rbf, best gamma 1.0

在这里插入图片描述

Comparison of different kernel

from mlxtend.plotting import plot_decision_regions
C=1.0
maxscore_among_kernels=0
bestkernel=''
models = (svm.SVC(kernel='linear', C=C),
          svm.SVC(kernel='rbf', gamma=1.0, C=C),
          svm.SVC(kernel='poly', degree=3, gamma=0.2, C=C),
          svm.SVC(kernel='sigmoid',gamma=0.1,C=C)
         )
titles = ('SVC with linear kernel',
          'SVC with RBF kernel',
          'SVC with polynomial (degree 3) kernel',
          'SVC with sigmoid kernel')
models = (clf.fit(X_train_reduced, y_train) for clf in models)

t = np.array(y_train)
t = t.astype(np.integer)

for clf,title in zip(models,titles):
    clf.fit(X_train_reduced,t)
    plt.figure(figsize = [15,10])
    plot_decision_regions(X_train_reduced, t, clf = clf, hide_spines = False, colors = 'purple,limegreen',markers = ['^','v'])
    plt.title(title)
    scores = cross_val_score(clf,X_train_reduced,y_train,cv=5)
    print(title,clf.score(X_train_reduced,y_train))
    if scores.mean()>maxscore_among_kernels:
        maxscore_among_kernels=scores.mean()
        bestkernel=title
    print("%s mean ACC:%f"%(title,scores.mean()))
print('best kernel is:', bestkernel)

SVC with linear kernel 0.675561797752809
SVC with linear kernel mean ACC:0.678351
SVC with RBF kernel 0.7598314606741573
SVC with RBF kernel mean ACC:0.737349
SVC with polynomial (degree 3) kernel 0.6573033707865169
SVC with polynomial (degree 3) kernel mean ACC:0.653068
SVC with sigmoid kernel 0.6334269662921348
SVC with sigmoid kernel mean ACC:0.651650
best kernel is: SVC with RBF kernel

在这里插入图片描述

C=1.0
svc = SVC(kernel='rbf', gamma=1.0, C=C)
svc.fit(X_train_reduced, y_train)
y_pred_svc = svc.predict(X_test_reduced)
y_pred_svc

array([1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0,
       1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0], dtype=int64)

Naive Bayes Classifier

In this section we could use more distribution function for the naive Bayes classifier, but we encounter that MultinomialNB cannot receive a less than zero input. So in this section we will only be using the Gaussian and Bernoulli distribution.

from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred_gnb = gnb.predict(X_test)
bnb = BernoulliNB()
bnb.fit(X_train, y_train)
y_pred_bnb = gnb.predict(X_test)
y_pred_gnb,y_pred_bnb

(array([1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
        0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
        1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1,
        1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0,
        0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0,
        0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1,
        1, 0, 0], dtype=int64),
 array([1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
        0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
        1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1,
        1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0,
        0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0,
        0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1,
        1, 0, 0], dtype=int64))

Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, y_train)
y_pred_rf = random_forest.predict(X_test)
random_forest.score(X_train, y_train)

random_forest_train = round(random_forest.score(X_train, y_train) * 100, 2)
random_forest_accuracy = round(accuracy_score(y_pred_rf, y_test) * 100, 2)
score = cross_val_score(random_forest,X_train,y_train,cv=5)
print(score.mean())
print("Training Accuracy     :",random_forest_train)
print("Model Accuracy Score  :",random_forest_accuracy)

0.7893725992317541
Training Accuracy     : 98.6
Model Accuracy Score  : 82.12

from sklearn import tree
import graphviz

dot_data = tree.export_graphviz(random_forest.estimators_[-1], out_file=None, feature_names=X_train.columns, class_names=['0','1'],filled=True,rounded=True,special_characters=True)
graph = graphviz.Source(dot_data)
graph.view()

'Source.gv.pdf'

For it is way too big to demonstrate the random forest in the .ipynb, we only screenshot part of it for overviewing. The function ‘graph.view’ would open the whole random forest as a pdf file.

在这里插入图片描述

Neron Network

from keras.models import Sequential
from keras.layers import Dense
from sklearn.model_selection import train_test_split

model = Sequential()
model.add(Dense(units=64, activation='relu', input_dim=X_train.shape[1]))
model.add(Dense(units=1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 64)                576       
                                                                 
 dense_1 (Dense)             (None, 1)                 65        
                                                                 
=================================================================
Total params: 641
Trainable params: 641
Non-trainable params: 0
_________________________________________________________________

model.fit(X_train, y_train, epochs=1000, batch_size=32, verbose=1)
loss, accuracy = model.evaluate(X_test, y_test)
print("Test Loss:", loss)
print("Test Accuracy:", accuracy)

Epoch 1000/1000
23/23 [==============================] - 0s 1ms/step - loss: 0.2903 - accuracy: 0.8778
6/6 [==============================] - 0s 1ms/step - loss: 0.4956 - accuracy: 0.8380
Test Loss: 0.49561816453933716
Test Accuracy: 0.8379888534545898

The neuron network’ s result is not suitable for the Titanic case since it’s output is not binary. We need to transform the result into binary.

y_pred_nn = model.predict(X_test)
y_pred_nn =binary_predictions = np.where(y_pred_bnb >= 0.5, 1, 0)
y_pred_nn

6/6 [==============================] - 0s 800us/step





array([1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1,
       1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1,
       1, 0, 0])

Model evaluation

In this section we will be using method confusion matrix, accuracy, precision score, recall score, F1 score to evaluate the performance.

Confusion matrix

The confusion matrix is a tabular representation that provides a detailed breakdown of the performance of a classification model. It summarizes the predictions made by the model and compares them to the actual class labels of the dataset.

from sklearn.metrics import confusion_matrix
import seaborn as sns

# Compute the confusion matrix
cm_svm = confusion_matrix(y_test, y_pred_svc)
cm_bnb = confusion_matrix(y_test,y_pred_bnb)
cm_gnb = confusion_matrix(y_test,y_pred_gnb)
cm_rf = confusion_matrix(y_test, y_pred_rf)
cm_nn = confusion_matrix(y_test, y_pred_nn)

# plot confusion matrix
fig, axes = plt.subplots(5, 1, figsize=(12, 20))
plt.subplots_adjust(hspace=0.5, wspace=0.5)

plt.subplot(5,1,1)
sns.heatmap(cm_svm, annot=True, fmt='d', cmap='plasma')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix_SVM_kernel=\'rbf\'')
plt.subplot(5,1,2)
sns.heatmap(cm_bnb, annot=True, fmt='d', cmap='plasma')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix_Naive Bayes_Bernoulli')
plt.subplot(5,1,3)
sns.heatmap(cm_gnb, annot=True, fmt='d', cmap='plasma')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix_Naive Bayes_Gaussian')
plt.subplot(5,1,4)
sns.heatmap(cm_rf, annot=True, fmt='d', cmap='plasma')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix_Random Forest')
plt.subplot(5,1,5)
sns.heatmap(cm_rf, annot=True, fmt='d', cmap='plasma')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix_Neron Network')

plt.show()

在这里插入图片描述

Accuracy measures the overall correctness of a classification model. It is the ratio of the correctly predicted samples to the total number of samples in the dataset. Precision measures the proportion of correctly predicted positive samples out of the total predicted positive samples. It focuses on the accuracy of positive predictions.Recall, also known as sensitivity or true positive rate, measures the proportion of correctly predicted positive samples out of the total actual positive samples.The F1 score is a harmonic mean of precision and recall. It provides a single metric that balances both precision and recall.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Compute accuracy
accuracy_svm = accuracy_score(y_test, y_pred_svc)
accuracy_bnb = accuracy_score(y_test, y_pred_bnb)
accuracy_gnb = accuracy_score(y_test, y_pred_gnb)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
accuracy_nn = accuracy_score(y_test, y_pred_nn)


# Compute precision
precision_svm = precision_score(y_test, y_pred_svc)
precision_bnb = precision_score(y_test, y_pred_bnb)
precision_gnb = precision_score(y_test, y_pred_gnb)
precision_rf = precision_score(y_test, y_pred_rf)
precision_nn = precision_score(y_test, y_pred_nn)

# Compute recall
recall_svm = recall_score(y_test, y_pred_svc)
recall_bnb = recall_score(y_test, y_pred_bnb)
recall_gnb = recall_score(y_test, y_pred_gnb)
recall_rf = recall_score(y_test, y_pred_rf)
recall_nn = recall_score(y_test, y_pred_nn)

# Compute F-score
fscore_svm = f1_score(y_test, y_pred_svc)
fscore_bnb = f1_score(y_test, y_pred_bnb)
fscore_gnb = f1_score(y_test, y_pred_gnb)
fscore_rf = f1_score(y_test, y_pred_rf)
fscore_nn = f1_score(y_test, y_pred_nn)

# Create a DataFrame
result = {
    'Accuracy': [accuracy_svm, accuracy_bnb, accuracy_gnb, accuracy_rf, accuracy_nn],
    'Precision': [precision_svm, precision_bnb, precision_gnb, precision_rf, precision_nn],
    'Recall': [recall_svm, recall_bnb, recall_gnb, recall_rf, recall_nn],
    'F1-score': [fscore_svm, fscore_bnb, fscore_gnb, fscore_rf, fscore_nn]
}
score_df = pd.DataFrame(result, index=['SVM', 'BNB', 'GNB', 'RF', 'NN'])
score_df

	Accuracy	Precision	Recall	F1-score
SVM	0.670391	0.631579	0.486486	0.549618
BNB	0.815642	0.805970	0.729730	0.765957
GNB	0.815642	0.805970	0.729730	0.765957
RF	0.821229	0.818182	0.729730	0.771429
NN	0.815642	0.805970	0.729730	0.765957

print("Best ACC performance model:", score_df['Accuracy'].idxmax())
print("Best Precision performance model:", score_df['Precision'].idxmax())
print("Best Recall performance model:", score_df['Recall'].idxmax())
print("Best F1 performance model:", score_df['F1-score'].idxmax())

Best ACC performance model: RF
Best Precision performance model: RF
Best Recall performance model: BNB
Best F1 performance model: RF

import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, plot_roc_curve

# Compute ROC curve for each model
fpr_svm, tpr_svm, _ = roc_curve(y_test, y_pred_svc)
fpr_bnb, tpr_bnb, _ = roc_curve(y_test, y_pred_bnb)
fpr_gnb, tpr_gnb, _ = roc_curve(y_test, y_pred_gnb)
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_pred_rf)
fpr_nn, tpr_nn, _ = roc_curve(y_test, y_pred_nn)

# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr_svm, tpr_svm, label='SVM')
plt.plot(fpr_bnb, tpr_bnb, label='BNB')
plt.plot(fpr_gnb, tpr_gnb, label='GNB')
plt.plot(fpr_rf, tpr_rf, label='RF')
plt.plot(fpr_nn, tpr_nn, label='NN')
plt.plot([0, 1], [0, 1], 'k--')  # Diagonal line
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend()
plt.show()

在这里插入图片描述

AUC is used for evaluating the performance of binary classification models based on the Receiver Operating Characteristic (ROC) curve. The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various classification thresholds. AUC represents the area under this curve and provides an aggregate measure of the model’s ability to discriminate between positive and negative classes. A higher AUC value indicates better overall performance, with a value of 1 representing a perfect classifier and a value of 0.5 indicating a random classifier.

from sklearn.metrics import roc_auc_score

# Compute AUC for each model
auc_svm = roc_auc_score(y_test, y_pred_svc)
auc_bnb = roc_auc_score(y_test, y_pred_bnb)
auc_gnb = roc_auc_score(y_test, y_pred_gnb)
auc_rf = roc_auc_score(y_test, y_pred_rf)
auc_nn = roc_auc_score(y_test, y_pred_nn)

# Create a DataFrame for AUC
auc = {
    'AUC': [auc_svm, auc_bnb, auc_gnb, auc_rf, auc_nn]
}

df_auc = pd.DataFrame(auc, index=['SVM', 'BNB', 'GNB', 'RF', 'NN'])

# Output the DataFrame
print(df_auc)
print("Best AUC performed model:",df_auc.idxmax())

          AUC
SVM  0.643243
BNB  0.802960
GNB  0.802960
RF   0.807722
NN   0.802960
Best AUC performed model: AUC    RF
dtype: object

Comprehensively, Random Forest Classifier has the best performance in the previous evaluation. For binary classification, Random Forest is an ensemble learning method that combines multiple decision trees. The model aggregates the predictions of individual trees and selects the majority vote or average prediction, resulting in a more robust and accurate final prediction.Random Forest can effectively capture nonlinear relationships between features and the target variable. Each decision tree in the ensemble is constructed based on different random subsets of features, allowing the model to learn diverse patterns and capture complex interactions between variables. This flexibility makes Random Forest well-suited for capturing complex decision boundaries and handling nonlinear relationships in the data.Random Forest is robust to outliers and irrelevant features. Outliers have limited impact on the overall model performance as each decision tree in the ensemble is built independently. Additionally, the random feature selection process ensures that irrelevant features have a diminished influence on the final predictions, resulting in a more focused and accurate model.