回归问题和分类问题的区别在于,其待测目标是连续变量,比如:价格、降水量等等。
模型介绍
线性分类器为了便于将原本在实数域上的结果映射到(0,1)区间,引入了逻辑斯蒂函数。而在线性回归问题中,由于预测目标直接是实数域上的数值,因此优化目标就更为简单,即最小化预测结果与真实值之间的差异。
当使用一组 m m m个用于训练的特征向量 X = < x 1 , x 2 , ⋅ ⋅ ⋅ x m > X=<x^1,x^2,···x^m> X=<x1,x2,⋅⋅⋅xm>和其对应的回归目标 y = < y 1 , y 2 , ⋅ ⋅ ⋅ y m > y=<y^1,y^2,···y^m> y=<y1,y2,⋅⋅⋅ym>时,我们希望线性回归模型可以最小二乘(Generalized Least Squares)预测的损失 L ( w , b ) L(w,b) L(w,b),这样一来,线性回归器的常见优化目标如式(13)所示。
argmin w , b L ( w , b ) = argmin w , b ∑ m k − 1 ( f ( w , x , b ) − y k ) 2 ( 13 ) \underset{w, b}{\operatorname{argmin}} L(w, b)=\underset{w, b}{\operatorname{argmin}} \sum_{m}^{k-1}\left(f(w, x, b)-y^{k}\right)^{2} \qquad (13) w,bargminL(w,b)=w,bargminm∑k−1(f(w,x,b)−yk)2(13)
同样,为了学习到决定模型的参数,即系数 w w w和截距 b b b,仍然可以使用一种精确计算的解析算法和一种快速的随机梯度下降估算方法(Stochastic Gradient Descend)
编程实践
美国波士顿地区房价预测
# 从sklearn.datasets导入波士顿房价数据读取器。
from sklearn.datasets import load_boston
# 从读取房价数据存储在变量boston中。
boston = load_boston()
# 输出数据描述。
print boston.DESCR
Boston House Prices dataset
Notes
------
Data Set Characteristics:
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive
:Median Value (attribute 14) is usually the target
:Attribute Information (in order):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
:Missing Attribute Values: None
:Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.
The Boston house-price data has been used in many machine learning papers that address regression
problems.
**References**
- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learnin