How to do Binary Classification Machine Learning Case Study Project

How do you work through a predictive modeling machine learning problem end-to-end? In this lesson you will work through a case study classification predictive modeling problem in Python including each step of the applied machine learning process. After completing this project, you will know:

  • How to work through a classification predictive modeling problem end-to-end.
  • How to use data transforms to improve model performance.
  • How to use algorithm tuning to improve model performance
  • How to use ensemble methods and tuning of ensemble methods to improve model performance.

1.1 Problem Definition

The focus of this project will be the Sonar Mines vs Rocks dataset. The problem is to predict metal or rock objects from sonar return data. Each pattern is a set of 60 numbers in the range 0.0 to 1.0. Each number represents the energy within a particular frequency band, integrated over a certain period of time. The label associated with each record contains the letter R if the object is a rock and M if it is a mine (metal cylinder). The numbers in the labels are in increasing order of aspect angle, but they do not encode the angle directly.

1.2 Load the Dataset

Let’s start off by loading the libraries required for this project

# Load libraries
import numpy as np
from matplotlib import pyplot
from pandas import read_csv
from pandas import set_option
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier

You can download the dataset from the UCI Machine Learning repository website(https://goo.gl/NXoJfR) and save it in the local working directory with the filename sonar.all-data.csv.

# Load dataset
url = 'sonar.all-data.csv'
dataset = read_csv(url, header=None)

You can see that we are not specifying the names of the attributes this time. This is because other than the class attribute (the last column), the variables do not have meaningful names. We also indicate that there is no header information, this is to avoid file loading code taking the first record as the column names. Now that we have the dataset loaded we can take a look at it.

1.3 Analyze Data

Let’s take a closer look at our loaded data.

1.3.1 Descriptive Statistics

We will start off by confirming the dimensions of the dataset, e.g. the number of rows and columns.

# shape
print(dataset.shape)

We have 208 instances to work with and can confirm the data has 61 attributes including the class attribute.

 Let’s also look at the data types of each attribute.

# types
set_option('display.max_rows',500)
print(dataset.dtypes)

We can see that all of the attributes are numeric (float) and that the class value has been read in as an object.

 Let’s now take a peek at the first 20 rows of the data.

# head
set_option('display.width',100)
print(dataset.head(20))

This does not show all of the columns, but we can see all of the data has the same scale. We can also see that the class attribute (60) has string values.

 Let’s summarize the distribution of each attribute.

# descriptions, change precision to 3 places
set_option('precision', 3)
print(dataset.describe())

 Let’s take a quick look at the breakdown of class values.

# class distribution
print(dataset.groupby(60).size())

 We can see that the classes are reasonably balanced between M (mines) and R (rocks).

 1.3.2 Unimodal Data Visualizations

Let’s look at visualizations of individual attributes. It is often useful to look at your data using multiple different visualizations in order to spark ideas. Let’s look at histograms of each attribute to get a sense of the data distributions.

# histogram
dataset.hist(sharex=False,sharey=False,xlabelsize=1, ylabelsize=1)
pyplot.show()

We can see that there are a lot of Gaussian-like distributions and perhaps some exponentiallike distributions for other attributes.

 Let’s take a look at the same perspective of the data using density plots.

# Visualize the dataset with Density Plots
# density
dataset.plot(kind='density',subplots=True, layout=(8,8), sharex=False, legend=False, fontsize=1)
pyplot.show()

This is useful, you can see that many of the attributes have a skewed distribution. A power transform like a Box-Cox transform that can correct for the skew in distributions might be useful.

 It is always good to look at box and whisker plots of numeric attributes to get an idea of the spread of values.

# box and whisker plots
dataset.plot(kind='box', subplots=True, layout=(8,8), sharex=False, sharey=False, fontsize=1)
pyplot.show()

1.3.3 Multimodel Data Visualizations

Let’s visualize the correlations between the attributes.

# Visualize the correlations between attributes
# correlation matrix
fig = pyplot.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(dataset.corr(),vmin=-1,vmax=1,interpolation='none')
fig.colorbar(cax)
pyplot.show()

It looks like there is also some structure in the order of the attributes. The red around the diagonal suggests that attributes that are next to each other are generally more correlated with each other. The blue patches also suggest some moderate negative correlation the further attributes are away from each other in the ordering. This makes sense if the order of the attributes refers to the angle of sensors for the sonar chirp.

1.4 Validation Dataset

It is a good idea to use a validation hold-out set. This is a sample of the data that we hold back from our analysis and modeling. We use it right at the end of our project to confirm the accuracy of our final model. It is a smoke test that we can use to see if we messed up and to give us confidence on our estimates of accuracy on unseen data. We will use 80% of the dataset for modeling and hold back 20% for validation.

# Split-out validation dataset
array = dataset.values
X = array[:,0:60].astype(float)
Y = array[:,60]
validation_size = 0.20
seed = 7
X_train,X_validation,Y_train,Y_validation = train_test_split(X, Y, test_size=validation_size,random_state=seed)

1.5 Evaluate Algorithms: Baseline

We don’t know what algorithms will do well on this dataset. Gut feel suggests distance based algorithms like k-Nearest Neighbors and Support Vector Machines may do well. Let’s design our test harness. We will use 10-fold cross validation. The dataset is not too small and this is a good standard test harness configuration. We will evaluate algorithms using the accuracy metric. This is a gross metric that will give a quick idea of how correct a given model is. More useful on binary classification problems like this one.

# Prepare the Test Harness for Evaluating Algorithms
# Test options and evaluation metric
num_folds = 10
seed = 7
scoring = 'accuracy'

Let’s create a baseline of performance on this problem and spot-check a number of different algorithms. We will select a suite of different algorithms capable of working on this classification problem. The six algorithms selected include:

  • Linear Algorithms: Logistic Regression (LR) and Linear Discriminant Analysis (LDA).
  • Nonlinear Algorithms: Classification and Regression Trees (CART), Support Vector Machines (SVM), Gaussian Naive Bayes (NB) and k-Nearest Neighbors (KNN).
# Prepare Algorithms to Evaluate
models = []
models.append(('LR',LogisticRegression()))
models.append(('LDA',LinearDiscriminantAnalysis()))
models.append(('KNN',KNeighborsClassifier()))
models.append(('CART',DecisionTreeClassifier()))
models.append(('SVM',SVC()))

The algorithms all use default tuning parameters. Let’s compare the algorithms. We will display the mean and standard deviation of accuracy for each algorithm as we calculate it and collect the results for use later.

# Evaluate Algorithms Using the Test Harness
results = []
names = []
for name, model in models:
    kfold = KFold(n_splits=num_folds, random_state=seed, shuffle=True)
    cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(),cv_results.std())
    print(msg)

Running the example provides the output below. The results suggest That both Logistic Regression and k-Nearest Neighbors may be worth further study.

 These are just mean accuracy values. It is always wise to look at the distribution of accuracy values calculated across cross validation folds. We can do that graphically using box and whisker plots.

# Visualization of the Distribution of Algorithm Performance
# Compare Algorithm
fig = pyplot.figure()
fig.suptitle('Algorithm Comparsion')
ax = fig.add_subplot(111)
pyplot.boxplot(results)
ax.set_xticklabels(names)
pyplot.show()

The results show a tight distribution for KNN which is encouraging, suggesting low variance. The poor results for SVM are surprising.

 It is possible that the varied distribution of the attributes is having an effect on the accuracy of algorithms such as SVM. In the next section we will repeat this spot-check with a standardized copy of the training dataset.

1.6 Evaluate Algorithms: Standardize Data

We suspect that the differing distributions of the raw data may be negatively impacting the skill of some of the algorithms. Let’s evaluate the same algorithms with a standardized copy of the dataset. This is where the data is transformed such that each attribute has a mean value of zero and a standard deviation of one. We also need to avoid data leakage when we transform the data. A good way to avoid leakage is to use pipelines that standardize the data and build the model for each fold in the cross validation test harness. That way we can get a fair estimation of how each model with standardized data might perform on unseen data.

# Standardize the dataset
pipelines = []
pipelines.append(('ScaledLR',Pipeline([('Scaler',StandardScaler()),('LR',LogisticRegression())])))
pipelines.append(('ScaledLDA',Pipeline([('Scaler',StandardScaler()),('LDA',LinearDiscriminantAnalysis())])))
pipelines.append(('ScaledKNN',Pipeline([('Scaler',StandardScaler()),('KNN',KNeighborsClassifier())])))
pipelines.append(('ScaledCART',Pipeline([('Scaler',StandardScaler()),('CART',DecisionTreeClassifier())])))
pipelines.append(('ScaledNB',Pipeline([('Scaler',StandardScaler()),('NB',GaussianNB())])))
pipelines.append(('ScaledSVM',Pipeline([('Scaler',StandardScaler()),('SVM',SVC())])))
results = []
names = []
for name,model in pipelines:
    kfold = KFold(n_splits=num_folds,random_state=seed,shuffle=True)
    cv_results = cross_val_score(model,X_train,Y_train,cv=kfold,scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name,cv_results.mean(),cv_results.std())
    print(msg)

Running the example provides the results listed below. We can see that KNN is still doing well, even better than before. We can also see that the standardization of the data has lifted the skill of SVM to be the most accurate algorithm tested so far.

 Again, we should plot the distribution of the accuracy scores using box and whisker plots.

# Visualization of the Distribution of Algorithm Performance on the Scaled Dataset
# Compare Algortihm
fig = pyplot.figure()
fig.suptitle('Scaled Algorithm Comparsion')
ax = fig.add_subplot(111)
pyplot.boxplot(results)
ax.set_xticklabels(names)
pyplot.show()

The results suggest digging deeper into the SVM and KNN algorithms. It is very likely that configuration beyond the default may yield even more accurate models. 

1.7 Algorithm Tuning

In this section we investigate tuning the parameters for two algorithms that show promise from the spot-checking in the previous section: KNN and SVM.

1.7.1 Tuning KNN

We can start off by tuning the number of neighbors for KNN. The default number of neighbors is 7. Below we try all odd values of k from 1 to 21, covering the default value of 7. Each k value is evaluated using 10-fold cross validation on the training standardized dataset.

# Tune the KNN Algorithm on the Scaled Dataset
# Tune scaled KNN
scaler = StandardScaler().fit(X_train)
rescaledX = scaler.transform(X_train)
neighbors = [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21]
param_grid = dict(n_neighbors=neighbors)
model = KNeighborsClassifier()
kfold = KFold(n_splits=num_folds,random_state=seed, shuffle=True)
grid = GridSearchCV(estimator=model,param_grid=param_grid, scoring=scoring, cv=kfold)
grid_result = grid.fit(rescaledX,Y_train)
print("Best: %f using %s" % (grid_result.best_score_,grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" %(mean, stdev, param))

We can print out configuration that resulted in the highest accuracy as well as the accuracy of all values tried. Running the example we see the results below.

Best: 0.836029 using {'n_neighbors': 1}
0.836029 (0.079487) with: {'n_neighbors': 1}
0.813603 (0.088021) with: {'n_neighbors': 3}
0.814338 (0.096870) with: {'n_neighbors': 5}
0.777574 (0.120387) with: {'n_neighbors': 7}
0.730147 (0.099376) with: {'n_neighbors': 9}
0.741544 (0.073970) with: {'n_neighbors': 11}
0.710662 (0.105829) with: {'n_neighbors': 13}
0.723162 (0.080983) with: {'n_neighbors': 15}
0.698897 (0.072669) with: {'n_neighbors': 17}
0.710662 (0.091337) with: {'n_neighbors': 19}
0.698897 (0.091195) with: {'n_neighbors': 21}

We can see that the optimal configuration is K=1. This is interesting as the algorithm will make predictions using the most similar instance in the training dataset alone.

1.7.2 Tuning SVM

We can tune two key parameters of the SVM algorithm, the value of C (how much to relax the margin) and the type of kernel. The default for SVM (the SVC class) is to use the Radial Basis Function (RBF) kernel with a C value set to 1.0. Like with KNN, we will perform a grid search using 10-fold cross validation with a standardized copy of the training dataset. We will try a number of simpler kernel types and C values with less bias and more bias (less than and more than 1.0 respectively).

# Tune scaled SVM
scaler = StandardScaler().fit(X_train)
rescaledX = scaler.transform(X_train)
c_values = [0.1, 0.3, 0.5, 0.7, 0.9, 1.0, 1.3, 1.5, 1.7, 2.0]
kernel_values = ['linear','poly','rbf', 'sigmoid']
param_grid = dict(C=c_values, kernel=kernel_values)
model = SVC()
kfold = KFold(n_splits=num_folds, random_state=seed,shuffle=True)
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring,cv=kfold)
grid_result = grid.fit(rescaledX,Y_train)
print("Best: %f using %s" %(grid_result.best_score_,grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means,stds,params):
    print("%f (%f) with: %r" %(mean, stdev,param))

Running the example prints out the best configuration, the accuracy as well as the accuracies for all configuration combinations.

Best: 0.850000 using {'C': 1.7, 'kernel': 'rbf'}
0.748529 (0.069953) with: {'C': 0.1, 'kernel': 'linear'}
0.582721 (0.127062) with: {'C': 0.1, 'kernel': 'poly'}
0.601103 (0.184435) with: {'C': 0.1, 'kernel': 'rbf'}
0.712868 (0.116579) with: {'C': 0.1, 'kernel': 'sigmoid'}
0.754412 (0.082337) with: {'C': 0.3, 'kernel': 'linear'}
0.644118 (0.099873) with: {'C': 0.3, 'kernel': 'poly'}
0.742279 (0.081853) with: {'C': 0.3, 'kernel': 'rbf'}
0.748529 (0.069953) with: {'C': 0.3, 'kernel': 'sigmoid'}
0.765809 (0.070336) with: {'C': 0.5, 'kernel': 'linear'}
0.704779 (0.098225) with: {'C': 0.5, 'kernel': 'poly'}
0.784559 (0.068922) with: {'C': 0.5, 'kernel': 'rbf'}
0.760662 (0.065632) with: {'C': 0.5, 'kernel': 'sigmoid'}
0.759926 (0.083206) with: {'C': 0.7, 'kernel': 'linear'}
0.759559 (0.093807) with: {'C': 0.7, 'kernel': 'poly'}
0.814338 (0.059832) with: {'C': 0.7, 'kernel': 'rbf'}
0.761029 (0.079602) with: {'C': 0.7, 'kernel': 'sigmoid'}
0.765441 (0.066964) with: {'C': 0.9, 'kernel': 'linear'}
0.789706 (0.094189) with: {'C': 0.9, 'kernel': 'poly'}
0.808088 (0.062884) with: {'C': 0.9, 'kernel': 'rbf'}
0.760662 (0.079898) with: {'C': 0.9, 'kernel': 'sigmoid'}
0.771691 (0.062141) with: {'C': 1.0, 'kernel': 'linear'}
0.814338 (0.093230) with: {'C': 1.0, 'kernel': 'poly'}
0.825735 (0.072291) with: {'C': 1.0, 'kernel': 'rbf'}
0.754779 (0.085671) with: {'C': 1.0, 'kernel': 'sigmoid'}
0.772059 (0.076696) with: {'C': 1.3, 'kernel': 'linear'}
0.826103 (0.075722) with: {'C': 1.3, 'kernel': 'poly'}
0.838235 (0.092498) with: {'C': 1.3, 'kernel': 'rbf'}
0.761029 (0.075130) with: {'C': 1.3, 'kernel': 'sigmoid'}
0.772426 (0.085441) with: {'C': 1.5, 'kernel': 'linear'}
0.832353 (0.076911) with: {'C': 1.5, 'kernel': 'poly'}
0.844118 (0.089455) with: {'C': 1.5, 'kernel': 'rbf'}
0.730882 (0.080208) with: {'C': 1.5, 'kernel': 'sigmoid'}
0.778676 (0.085856) with: {'C': 1.7, 'kernel': 'linear'}
0.838971 (0.081826) with: {'C': 1.7, 'kernel': 'poly'}
0.850000 (0.089841) with: {'C': 1.7, 'kernel': 'rbf'}
0.718750 (0.087640) with: {'C': 1.7, 'kernel': 'sigmoid'}
0.778676 (0.085856) with: {'C': 2.0, 'kernel': 'linear'}
0.838971 (0.072879) with: {'C': 2.0, 'kernel': 'poly'}
0.850000 (0.081776) with: {'C': 2.0, 'kernel': 'rbf'}
0.730147 (0.056579) with: {'C': 2.0, 'kernel': 'sigmoid'}

 We can see the most accurate configuration was SVM with an RBF kernel and a C value of 1.5. The accuracy 85.0000% is seemingly better than what KNN could achieve.

1.8 Ensemble Methods

Another way that we can improve the performance of algorithms on this problem is by using ensemble methods. In this section we will evaluate four different ensemble machine learning algorithms, two boosting and two bagging methods:

  • Boosting Methods: AdaBoost (AB) and Gradient Boosting (GBM).
  • Bagging Methods: Random Forests (RF) and Extra Trees (ET).

We will use the same test harness as before, 10-fold cross validation. No data standardization is used in this case because all four ensemble algorithms are based on decision trees that are less sensitive to data distributions.

# Evaluate Ensemble Algorithms
# ensembles
ensembles = []
ensembles.append(('AB',AdaBoostClassifier()))
ensembles.append(('GBM',GradientBoostingClassifier()))
ensembles.append(('RF', RandomForestClassifier()))
ensembles.append(('ET', ExtraTreesClassifier()))
results = []
names = []
for name,model in ensembles:
    kfold = KFold(n_splits=num_folds, random_state=seed, shuffle=True)
    cv_results = cross_val_score(model,X_train,Y_train,cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(),cv_results.std())
    print(msg)

Running the example provides the following accuracy scores.

AB: 0.782721 (0.072445)
GBM: 0.808088 (0.110720)
RF: 0.820221 (0.074578)
ET: 0.861765 (0.065379)

We can see that both boosting techniques provide strong accuracy scores in the low 80s (%) with default configurations. We can plot the distribution of accuracy scores across the cross validation folds.

# Visualize the Distribution of Ensemble Algorithm Performance
# Compare Algorithms
fig = pyplot.figure()
fig.suptitle('Ensemble Algorithm Comparsion')
ax = fig.add_subplot(111)
pyplot.boxplot(results)
ax.set_xticklabels(names)
pyplot.show()

 The results suggest ET may be worthy of further study, with a strong mean and a spread that skews up towards high 90s (%) in accuracy.

 

1.9 Finalize Model

The SVM showed the most promise as a low complexity and stable model for this problem. In this section we will finalize the model by training it on the entire training dataset and make predictions for the hold-out validation dataset to confirm our findings. A part of the findings was that SVM performs better when the dataset is standardized so that all attributes have a mean value of zero and a standard deviation of one. We can calculate this from the entire training dataset and apply the same transform to the input attributes from the validation dataset.

# prepare the model
scaler = StandardScaler().fit(X_train)
rescaledX = scaler.transform(X_train)
model = SVC(C=1.5)
model.fit(rescaledX,Y_train)

# estimate accuracy on validation dataset
rescaledValidationX = scaler.transform(X_validation)
predictions = model.predict(rescaledValidationX)
print(accuracy_score(Y_validation,predictions))
print(confusion_matrix(Y_validation,predictions))
print(classification_report(Y_validation,predictions))

 

 1.10 Summary

In this chapter you worked through a classification predictive modeling machine learning problem from end-to-end using Python. Specifically, the steps covered were:

  • Problem Definition (Sonar return data).
  • Loading the Dataset.
  • Analyze Data (same scale but different distributions of data).
  • Evaluate Algorithms (KNN looked good).
  • Evaluate Algorithms with Standardization (KNN and SVM looked good).
  • Algorithm Tuning (K=1 for KNN was good, SVM with an RBF kernel and C=1.5 was best).
  • Ensemble Methods (Bagging and Boosting, not quite as good as SVM).
  • Finalize Model (use all training data and confirm using validation dataset).

Working through this case study showed you how the recipes for specific machine learning tasks can be pulled together into a complete project. Working through this case study is good practice at applied machine learning using Python.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值