Difference between Classification and Regression
一、Abstract
As we all know,both regression and classification can be used in prediction problems.After I have read the paper-“You Only Look once”(YOLO),the approach they used in object detection is regression instead of repurposes classifier,I write this blog to help myself understand the difference.
二、Definition
1.Classification
from my perspective,classification is a model that you input variables(such as attributes like age,gender),the model make a decision to output a label or category.And the mainly classification algorithms are as follow:
(1)Decision Tree classification:
In this algorithm, a classification model is created by building a decision tree where every node of the tree is a test case for an attribute and each branch coming from the node is a possible value for that attribute.There is a instance of it.
# 加载数据集iris,仅使用前面两列特征
dataset=load_iris()
X=dataset.data[:,:2]
y=dataset.target
# 生成所有测试样本点
xZero,xFirst=X[:,0],X[:,1]
xMin,xMax=xZero.min()-1,xZero.max()+1
yMin,yMax=xFirst.min()-1,xFirst.max()+1
# 定义栅格间距
h=0.2
xx,yy=np.meshgrid(np.arange(xMin,xMax,h),
np.arange(yMin,yMax,h))
# 创建分类器,并进行训练
modal=DecisionTreeClassifier()
modal.fit(X,y)
# 创建画布
fig,ax=plt.subplots(figsize=(5,5))
plt.subplots_adjust(wspace=0.5,hspace=0.5)
# 预测并显示结果
z=modal.predict(np.c_[xx.ravel(),yy.ravel()])
z=z.reshape(xx.shape)
ax.contourf(xx,yy,z,cmap=plt.cm.coolwarm,alpha=0.8)
# 显示训练样本
ax.scatter(xZero,xFirst,c=y,cmap=plt.cm.coolwarm,s=10,edgecolors='k')
ax.set_xlim(xx.min(),xx.max())
ax.set_ylim(yy.min(),yy.max())
ax.set_xlabel('sepal length')
ax.set_ylabel('sepal width')
ax.set_title('DecisionTreeClassifier')
# 下图三种颜色的圈分别代表三种种类
print(dataset['target_names'])
The result is as Fig.1.
(2)K-nearest neighbors
The K-nearest neighbors algorithm assumes that similar things exist in close proximity to each other. The main goal of the algorithm is to determine how likely it is for a data point to be a part of the specific group.There is a instance of KNN.
# 生成数据并可视化
# center=5,代表5个聚类
data=make_blobs(n_samples=500,n_features=2,centers=5,
cluster_std=1.0,random_state=3)
x,y=data
plt.scatter(x[:,0],x[:,1],s=80,c=y,cmap=plt.cm.spring,edgecolors='k')
# 创建分类器并拟合
clf=KNeighborsClassifier()
clf.fit(x,y)
xMin,xMax=x[:,0].min()-1,x[:,0].max()+1
yMin,yMax=x[:,0].min()-1,x[:,0].max()+1
xx,yy=np.meshgrid(np.arange(xMin,xMax,.02),np.arange(yMin,yMax,.02))
z=clf.predict(np.c_[xx.ravel(),yy.ravel()])
z=z.reshape(xx.shape)
plt.pcolormesh(xx,yy,z,cmap=plt.cm.Pastel1)
plt.scatter(x[:,0],x[:,1],s=80,c=y,cmap=plt.cm.spring,edgecolors='k')
plt.xlim(xx.min(),xx.max())
plt.ylim(yy.min(),yy.max())
plt.title("Classifier:KNN")
# 预测
plt.scatter(-5,5,marker="*",c='red',s=200)
res=clf.predict([[-5,5]])
plt.text(-5,5,res)
plt.text(3.75,-13,"Score{:.2f}".format(clf.score(x,y)))
The result is as Fig.2.
2.Regression
Regression algorithms predict a continuous value based on the input variables. The main goal of regression problems is to estimate a mapping function based on the input and output variables.The mainly algorithms are as follow:
**(1)Simple Linear regression
With simple linear regression, you can estimate the relationship between one independent variable and another dependent variable using a straight line, given both variables are quantitative.There is a instance as follow:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_diabetes
from sklearn import linear_model
from sklearn.metrics import mean_squared_error,r2_score
# Data Prepare
# load datasets
diabeteX,diabeteY=load_diabetes(return_X_y=True)
# use one feature
diabeteX=diabeteX[:,np.newaxis,2]
# split the data into train|test sets
# from first element to -20,trian size U-20
diabeteX_train=diabeteX[:-20]
# from -20 to the end,test size 20
diabeteX_test=diabeteX[-20:]
# split the label into trian|test sets
diabeteY_trian=diabeteY[:-20]
diabeteY_test=diabeteY[-20:]
# MM
# create ordinary linear regression modle
reg=linear_model.LinearRegression()
# train the modle using train set
reg.fit(diabeteX_train,diabeteY_trian)
# P&S
# make predition using test sets
diabeteY_prediction=reg.predict(diabeteX_test)
# print the parameters
print('coefficients:',reg.coef_)
print('mean suqared error:{:.2f}'.format(mean_squared_error(
diabeteY_test,diabeteY_prediction)))
print('coefficient of determination:{:.2f}'.format(r2_score(diabeteY_test, diabeteY_prediction)))
# show in plt
plt.scatter(diabeteX_test,diabeteY_test,color='black')
plt.plot(diabeteX_test,diabeteY_prediction,color='blue',linewidth=3)
plt.title('Ordinary linear regression')
The result is as shown in Fig.3.
(2)Multiple Linear Regression
An extension of simple linear regression, multiple regression can predict the values of a dependent variable based on the values of two or more independent variables.
(3)Polynomial Regression
The main aim of polynomial regression is to model or find a nonlinear relationship between dependent and independent variables.
三、Conclusion
Maybe the most difference between Regression and Classification is that classification helps us to predict discrete class labels,while regression predicts continuous quantity. Of course there is also overlaps between them:
regression model can predict discrete value which is in the form of an integer quantity,for example,we can use sigmoid function to map the outputs from zero to one,and then convert them to different categories.Classification model can predicts a continuous value if it is in the form of a class label probability.