My First Machine Learning Project in Python Step-By-Step

You need to see how all of the pieces of a predictive modeling machine learning project actually fit together. In this lesson you will complete your first machine learning project using Python. In this step-by-step tutorial project you will:

  • Download and install Python SciPy and get the most useful package for machine learning in Python.
  • Load a dataset and understand it’s structure using statistical summaries and data visualization
  • Create 6 machine learning models, pick the best and build confidence that the accuracy is reliable.
  • If you are a machine learning beginner and looking to finally get started using Python, this tutorial was designed for you. Let’s get started!

1.1 The Hello World of Machine Learning

 The best small project to start with on a new tool is the classification of iris flowers. This is a good dataset for your first project because it is so well understood.

  • Attributes are numeric so you have to figure out how to load and handle data.
  • It is a classification problem, allowing you to practice with an easier type of supervised learning algorithm.
  • It is a multiclass classification problem(multi-nominal) that may require some specialized handling.
  • It only has 4 attributes and 150 rows, meaning it is small and easily fits into memory(and a screen or single sheet of paper).
  • All of the numeric attributes are in the same units and the same scale not requiring any special scaling or transforms to get started.

In this tutorial we are going to work through a small machine learning project end-to-end. Here is an overview of what we are going to cover:

  1. Loading the dataset.
  2. Summarizing the dataset
  3. Visualizing the dataset
  4. Evaluating some algorithms
  5. Making some predictions.

Take your time and work through each step. Try to type in the commands yourself or copy-and-paste the commands to speed things up. Start your Python interactive environment and let’s get started with your hello world machine learning project in Python.

1.2 Load The Data

In this step we are going to load the libraries and the iris data CSV file from URL.

1.2.1 Import libraries

First, let’s import all of the modules, functions and objects we are going to use in this tutorial.

# Load libraries
from pandas import read_csv
from pandas.plotting import scatter_matrix
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

Everything should load without error. If you have an error, stop. You need a working SciPy environment before continuing. See the advice in Chapter 2 about setting up your environment

 1.2.2 Load Dataset

The iris dataset can be downloaded from the UCI Machine Learning repository. We are using Pandas to load the data. We will also use Pandas next to explore the data both with descriptive statistics and data visualization. Note that we are specifying the names of each column when loading the data. This will help later when we explore the data.

# Load dataset
filename = 'iris.data.csv'
names = ['sepal-length','sepal-width','petal-length','petal-width','class']
dataset = read_csv(filename,names=names)

1.3 Summarize the Dataset

Now it is time to take a look at the data. In this step we are going to take a look at the data a few different ways:

  • Dimensions of the dataset.
  • Peek at the data itself.
  • Statistical summary of all attributes.
  • Breakdown of the data by the class variable.

Don’t worry, each look at the data is one command. These are useful commands that you can use again and again on future projects.

1.3.1 Dimension of Dataset

We can get a quick idea of how many instance(rows) and how many attributes(columns) the data contains with the shape property.

# Print the shape of the dataset
# shape
print(dataset.shape)

You should see 150 instances and 5 attributes:

 1.3.2 Peek at the Data

It is also always a good idea to actually eyeball your data.

# head 
print(dataset.head(20))

You should see the first 20 rows of the data:

 1.3.3 Statistical Summary

Now we can take a look at a summary of each attribute. This includes the count, mean,the min and max values as well as some percentiles.

# descriptions
print(dataset.describe())

We can see that all of the numerical values have the same scale(centimeters) and similar ranges between 0 and 8 centimeters.

 1.3.4 Class Distribution

Let's now take a look at the number of instances (rows) that belong to each class. We can view this as an absolute count.

# class distribution
print(dataset.groupby('class').size())

We can see that each class has the same number of instances (50 or 33% of the dataset).

1.4 Data Visualization

We now have a basic idea about the data. We need to extend this with some visualizations. We are going to look at two types of plots:

  • Univariate plots to better understand each attribute.
  • Multivariate plots to better understand the relationships between attributes.

1.4.1 Univariate Plots

We will start with some univariate plots, that is, plots of each individual variable. Given that the input variables are numeric, we can create box and whisker plots of each.

# box and whisker plots
dataset.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
pyplot.show()

we can also create a histogram of each input variable to get an idea of the distribution.

# histogram
dataset.hist()
pyplot.show()

It looks like perhaps two of the input variables have a Gaussian distribution. This is useful to note as we can use algorithms that can exploit this assumption

1.4.2 Multivariate Plots

Now we can look at the interactions between the variables. Let's look at scatter plots of all pairs of attributes. This can be helpful to spot structured relationship between input variables.

# scatter plot matrix
scatter_matrix(dataset)
pyplot.show()

Note the diagonal grouping of some pairs of attributes. This suggests a high correlation and a predictable relationship. 

1.5 Evaluate Some Algorithms

Now it is time to create some models of the data and estimate their accuracy on unseen data. Here is what we are going to cover in this step:

  1. Separate out a validation dataset.
  2. Setup the test harness to use 10-fold cross validation.
  3. Build 5 different models to predict species from flower measurements.
  4. Select the best model.

1.5.1 Create a Validation Dataset

We need to know whether or not the model that we created is any good. Later, we will use statistical methods to estimate the accuracy of the models that we create on unseen data. We also want a more concrete estimate of the accuracy of the best model on unseen data by evaluating it on acutal unseen data. That is , we are going to hold back some data that the algorithms will not get to see and we will use this data to get a second and independent idea of how accurate the best model might actually be. We will split the loaded dataset into two,80% of which we will use to train our models and 20% that we will hold back as a validation dataset.

# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
Y = array[:,4]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y, test_size=validation_size, random_state=seed)

You now having training data in the X_train and Y_train for preparing models  and a X_validation and Y_validation sets that we can use later.

1.5.2 Test Harness

We will use 10-fold cross validation to estimate accuracy. This will split our dataset into 10 parts, train on 9 and test on 1 and repeat for all combinations of train-test splits. We are using the metric of accuracy to evaluate models. This is a ratio of the number of correctly predicted instance divided by the total number of instances in the dataset multiplied by 100 to give a percentage(e.g. 95% accurate). We will be using the scoring variable when we run build and evaluate each model next.

1.5.3 Build Models

We don't know which algorithms would be good on this problem or what configurations to use.We get an idea from the plots that some of the classes are partially linearly separable in some dimensions,so we are expecting generally good results.Let's evaluate six different algorithms:

  • Logistic Regression(LR)
  • Linear Discriminant Analysis(LDA)
  • k-Nearest Neighbors(KNN)
  • Classification and Regression Trees(CART)
  • Gaussian Naive Bayes(NB)
  • Support Vector Machines(SVM)

This list is a good mixture of simple linear (LR and LDA), nonlinear (KNN, CART, NB and SVM) algorithms. We reset the random number seed before each run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable. Let’s build and evaluate our five models:

# Spot-Check Algorithms
models = []
models.append(('LR',LogisticRegression()))
models.append(('LDA',LinearDiscriminantAnalysis()))
models.append(('KNN',KNeighborsClassifier()))
models.append(('CART',DecisionTreeClassifier()))
models.append(('NB',GaussianNB()))
models.append(('SVM',SVC()))
# evaluate each model in turn
results = []
names = []
for name, model in models:
    kfold = KFold(n_splits=10, random_state=seed, shuffle=True)
    cv_results = cross_val_score(model,X_train, Y_train,cv=kfold,scoring='accuracy')
    results.append(cv_results)
    names.append(name)
    msg = "%s %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)
    

1.5.4 Select The Best Model

We now have 6 models and accuracy estimations for each. We need to compare the models to each other and select the most accurate. Running the example above, we get the following raw results:

 We can see that it looks like KNN has the largest estimated accuracy score. We can also create a plot of the model evaluation results and compare the spread and the mean accuracy of each model. There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times (10 fold cross validation).

 

# Compare Algorithms
fig = pyplot.figure()
fig.suptitle('Algorithm Comparsion')
ax = fig.add_subplot(111)
pyplot.boxplot(results)
ax.set_xticklabels(names)
pyplot.show()

You can see that the box and whisker plots are squashed at the top of the range, with many samples achieving 100% accuracy.

 

1.6 Make Predictions

The KNN algorithm was the most accurate model that we tested. Now we want to get an idea of the accuracy of the model on our validation dataset. This will give us an independent final check on the accuracy of the best model. It is important to keep a validation set just in case you made a slip during training, such as overfitting to the training set or a data leak. Both will result in an overly optimistic result. We can run the KNN model directly on the validation set and summarize the results as a final accuracy score, a confusion matrix and a classification report.

# Make predictions on validation dataset
knn = KNeighborsClassifier()
knn.fit(X_train, Y_train)
predictions = knn.predict(X_validation)
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))

We can see that the accuracy is 0.9 or 90%. The confusion matrix provides an indication of the three errors made. Finally the classification report provides a breakdown of each class by precision, recall, f1-score and support showing excellent results (granted the validation dataset was small).

 1.7 Summary

In this lesson you discovered step-by-step how to complete your first machine learning project in Python. You discovered that completing a small end-to-end project from loading the data to making predictions is the best way to get familiar with the platform.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值