文章目录
一、一元线性回归(unary linear regression)
1.模型及公式
给定数据集
D
=
(
x
1
,
y
1
)
,
(
x
2
,
y
2
)
,
⋅
⋅
⋅
,
(
x
m
,
y
m
)
D={(x_1,y_1),(x_2,y_2),···,(x_m,y_m)}
D=(x1,y1),(x2,y2),⋅⋅⋅,(xm,ym) ,我们尝试建立一个模型,使得对任一
x
i
x_i
xi,都能求出与之对应的
y
i
y_i
yi出来,而形象点,就是用一条直线去拟合这些样本点,由于拟合的模型是线性的,所以叫做线性回归;在得到了线性回归的模型后,我们可以根据x预测出对应的
y
y
y;
由于模型是线性的,所以我们可以用
f
(
x
i
)
=
w
x
i
+
b
f(x_i)=wx_i+b
f(xi)=wxi+b来建模,即求出
w
w
w和
b
b
b,这样我们就成功得到了线性回归的模型。
那么我们怎样评判我们建立的模型的好坏,即是否能满足我们正确获得样本点的取值的要求呢?一种常见做法是使用均分误差;均方误差描述了模型拟合样本点的好坏程度,模型拟合得越好,均方误差越小,反之越大;所以我们要尽可能最小化我们模型的均分误差。即
(
w
∗
,
b
∗
)
=
a
r
g
(
w
,
b
)
min
∑
i
=
1
m
(
f
(
x
i
)
−
y
i
)
2
=
a
r
g
(
w
,
b
)
min
∑
i
=
1
m
(
y
i
−
w
x
i
−
b
)
2
\left( w^*,b^* \right) =arg_{\left( w,b \right)}\min \sum_{i=1}^m{\left( f\left( x_i \right) -y_i \right) ^{\begin{array}{c} 2\\ \end{array}}} \\ =arg_{\begin{array}{c} \left( w,b \right)\\ \end{array}}\min \sum_{i=1}^m{\left( y_i-wx_i-b \right) ^2}
(w∗,b∗)=arg(w,b)mini=1∑m(f(xi)−yi)2=arg(w,b)mini=1∑m(yi−wxi−b)2
线性回归对应的函数在数学上是一个凸函数,即不存在局部最优解,只有一个全局最优解的函数。
可以想到的一种最小化的方法是求均方误差这个二元函数的鞍点,即分别使得均方误差对
w
w
w和
b
b
b的偏导为0,这样我们就可以求出
w
w
w和
b
b
b最优解的闭式解,即使得均方误差最小的
w
w
w和
b
b
b。
w
=
∑
i
=
1
m
y
i
(
x
i
−
x
−
)
∑
i
=
1
m
x
i
2
−
1
m
(
∑
i
=
1
m
x
i
)
2
w=\frac{\sum_{i=1}^m{y_i\left( x_i-\overset{-}{x} \right)}}{\sum_{i=1}^m{x_i^2-\frac{1}{m}\left( \sum_{i=1}^m{x_i} \right) ^2}}
w=∑i=1mxi2−m1(∑i=1mxi)2∑i=1myi(xi−x−)
b = 1 m ∑ i = 1 m ( y i − w x i ) b=\frac{1}{m}\sum_{i=1}^m{\left( y_i-wx_i \right)} b=m1i=1∑m(yi−wxi)
python代码实现:
2.列表实现
from matplotlib import pyplot as plt
#读文件
filename = "regression.txt"
filename2 = "regression_test.txt"
data_x = []
data_y = []
test_x = []
test_y = []
with open(filename, 'r') as f:#从文件中读取数据
for line in f.readlines():
data_x.append(float(line.split(',')[1]))
data_y.append(float(line.split(',')[0]))
with open(filename2, 'r') as f:
for line in f.readlines():
test_x.append(float(line.split(',')[1]))
test_y.append(float(line.split(',')[0]))
#处理
sum_x = sum(data_x)
square_x = []
for x in data_x:
square_x.append(x * x)
sum_squarex = sum(square_x)
mul = lambda x,y:x*y
data_x_2 = []
meanx = sum(data_x) / len(data_x)
for x in data_x:
data_x_2.append(x - meanx)
multi = list(map(mul, data_y, data_x_2))
w = sum(multi) / (sum_squarex - sum_x * sum_x /len(data_x))
su = lambda x,y:y - w * x
sub = list(map(su, data_x, data_y))
b = sum(sub) / len(data_x)
func_y = [w*x+b for x in data_x]
print(func_y)
#画图
fig = plt.figure(figsize=(20,8),dpi = 80)
ax1 = fig.add_subplot(111)
ax1.scatter(data_x, data_y,color = 'red', linewidth = 5.0)
ax1.plot(data_x, func_y, '-', color = 'blue', linewidth = 5.0)
plt.show()
3.numpy实现
import numpy as np
from matplotlib import pyplot as plt
file1 = "regression.txt"
data_x = np.loadtxt(file1, delimiter=',',usecols=1)
data_y = np.loadtxt(file1, delimiter=',',usecols=0)
#按公式处理即可
data_x_square = data_x ** 2
data_x_sum_square = sum(data_x) ** 2
m = len(data_x)
num = sum(data_y * (data_x - data_x.mean())) #分子
den = sum(data_x_square) - data_x_sum_square / m #分母
w = num / den
b = sum(data_y - w * data_x) / m
#画图
plt.plot(data_x, w * data_x + b, color='red')
plt.scatter(data_x, data_y)
plt.show()
4.numpy库函数实现
import numpy as np #使用numpy中自带的构建多项式模型函数poly1d
from matplotlib import pyplot as plt
file = "regression.txt"
data_x = np.loadtxt(file, delimiter=',',usecols=1)
data_y = np.loadtxt(file, delimiter=',',usecols=0)
linear_model = np.poly1d(np.polyfit(data_x, data_y, 1))#生成多项式回归模型,相当于一种函数
plt.scatter(data_x, data_y)
myline = np.linspace(-10, 10, 1000)#从-10到100有序生成1000个间隔相等的数
plt.plot(myline, linear_model(myline), color='red')
plt.show()
5.批量梯度下降(Batch Gradient Descent)实现
参考视频:吴恩达老师的机器学习
代码实现参考文章
梯度下降中定义的均方误差代价损失函数如下:
J
(
Θ
)
=
1
2
m
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
2
J\left( \varTheta \right) =\frac{1}{2m}\sum{_{i=1}^{m}\left( h_{\theta}\left( x^{\left( i \right)} \right) -y^{\left( i \right)} \right) ^2}
J(Θ)=2m1∑i=1m(hθ(x(i))−y(i))2
线性回归函数假定为:
h
Θ
(
x
(
i
)
)
=
Θ
0
+
Θ
1
x
1
(
i
)
h_{\varTheta}\left( x^{\left( i \right)} \right) =\varTheta _0+\varTheta _1x_{1}^{\left( i \right)}
hΘ(x(i))=Θ0+Θ1x1(i)
求解代价函数梯度,即让
J
(
Θ
)
J( \Theta )
J(Θ)分别对
Θ
0
\Theta_0
Θ0和
Θ
1
\Theta_1
Θ1求偏导可得:
δ
J
δ
Θ
0
=
1
m
∑
i
=
1
m
(
h
Θ
(
x
(
i
)
)
−
y
(
i
)
)
\frac{\delta J}{\delta \varTheta _0}=\frac{1}{m}\sum{_{i=1}^{m}\left( h_{\varTheta}\left( x^{\left( i \right)} \right) -y^{\left( i \right)} \right)}
δΘ0δJ=m1∑i=1m(hΘ(x(i))−y(i))
δ
J
δ
Θ
1
=
1
m
∑
i
=
1
m
(
h
Θ
(
x
(
i
)
)
−
y
(
i
)
)
x
1
(
i
)
\frac{\delta J}{\delta \varTheta _1}=\frac{1}{m}\sum{_{i=1}^{m}\left( h_{\varTheta}\left( x^{\left( i \right)} \right) -y^{\left( i \right)} \right)}x_{1}^{\left( i \right)}
δΘ1δJ=m1∑i=1m(hΘ(x(i))−y(i))x1(i)
最后将代价函数和代价函数梯度转换成向量相乘的形式:
J
(
Θ
)
=
1
2
m
(
X
Θ
−
y
→
)
T
(
X
Θ
−
y
→
)
J\left( \varTheta \right) =\frac{1}{2m}\left( X\varTheta -\overset{\rightarrow}{y} \right) ^T\left( X\varTheta -\overset{\rightarrow}{y} \right)
J(Θ)=2m1(XΘ−y→)T(XΘ−y→)
∇
J
(
Θ
)
=
1
m
X
T
(
X
Θ
−
y
→
)
\nabla J\left( \varTheta \right) =\frac{1}{m}X^T\left( X\varTheta -\overset{\rightarrow}{y} \right)
∇J(Θ)=m1XT(XΘ−y→)
代码实现
import numpy as np
from matplotlib import pyplot as plt
file = "regression.txt"
data_x = np.loadtxt(file, delimiter=',', usecols=1)
data_y = np.loadtxt(file, delimiter=',', usecols=0)
m = len(data_y)
xx = np.ones((m, 2)) #一元线性回归y=wx+b,要转换成y=b*x0 + w*x1,所以矩阵是两列;而由于x0恒为1,所以第0列均为1
xx[:, 1] = data_x
yy = data_y.reshape(m, 1) #y变成列向量的形式
alpha = 0.01 #初始化学习率
def cost_function(theta, xx, yy): #均方误差代价函数,转换为向量相乘的形式
func = np.dot(xx, theta) - yy
return (1 / (2 * m)) * np.dot(func.transpose(),func)
def gradient_function(theta, xx, yy): #代价函数的梯度函数,也已经转换为向量相乘的形式
func = np.dot(xx, theta) - yy
return np.dot(xx.transpose(), func) / m
def gradient_descent(alpha, xx, yy):
theta = np.array([1, 1]).reshape(2, 1) #设置w,b的初始值均为1,且为列向量的形式,即初始化theta
grad = gradient_function(theta, xx, yy)
while not np.all(np.abs(grad) <= 1e-6): #使梯度矩阵的每一个值都小于1e-6
theta = theta - alpha * grad #梯度下降
grad = gradient_function(theta, xx, yy) #更新梯度
return theta
def draw(theta, x, y):
w = theta[1]
b = theta[0]
plt.scatter(x, y)
plt.scatter(x, w * x + b, color='red')
plt.show()
theta = gradient_descent(alpha, xx, yy)
print('cost function:', cost_function(theta, xx, yy)[0][0])#代价函数值
draw(theta, data_x, data_y)
二、多元回归与多项式回归(multivariable regression and polynomial regression)
1.模型及公式
在正常情况下,影响一个y的可能不止一个x,而是多个变量,这时候就要综合这些变量即属性进行考虑,所以我们这时候的模型就不能是简单的线性模型,而要换成更加复杂一点的多元回归模型,这就是多元线性回归。模型形式变为 f ( x ) = w 1 x 1 + w 2 x 2 + ⋅ ⋅ ⋅ + w n x n f(x)=w_1x_1+w_2x_2+···+w_nx_n f(x)=w1x1+w2x2+⋅⋅⋅+wnxn,有 w ^ ∗ = a r g w ^ min ( y − X w ^ ) T ( y − X w ^ ) \hat{w}^*=arg_{\hat{w}}\min \left( y-X\hat{w} \right) ^T\left( y-X\hat{w} \right) w^∗=argw^min(y−Xw^)T(y−Xw^)
其中X为m行d+1列的矩阵,行数代表样例个数,列数-1(d)代表属性个数(变量x个数)
代价函数为:
E
w
^
=
(
y
−
X
w
^
)
T
(
y
−
X
w
^
)
E_{\hat{w}}=\left( y-X\hat{w} \right) ^T\left( y-X\hat{w} \right)
Ew^=(y−Xw^)T(y−Xw^)
经过一系列数学运算后得到一组d个元素的
1
∗
d
1*d
1∗d的矩阵
w
w
w*,即对应于各个属性
x
i
x_i
xi的
w
i
w_i
wi,即
w
^
∗
=
(
X
T
X
)
−
1
X
T
Y
\hat{w}^*=\left( X^TX \right) ^{-1}X^TY
w^∗=(XTX)−1XTY
最终的多元线性回归模型为:
f
(
x
^
i
)
=
x
^
i
T
(
X
T
X
)
−
1
X
T
y
f\left( \overset{}{\hat{x}_i} \right) =\overset{}{\hat{x}}_{i}^{T}\left( X^TX \right) ^{-1}X^Ty
f(x^i)=x^iT(XTX)−1XTy
多项式回归可以看做是多元回归的一种特殊形式,如
w
0
+
w
1
x
2
+
w
2
x
2
w_0+w_1x^2+w_2x^2
w0+w1x2+w2x2可以看做是
w
0
x
0
+
w
1
x
1
+
w
2
x
2
w_0x_0+w_1x_1+w_2x_2
w0x0+w1x1+w2x2,再用多元线性回归去解决该问题,得到
w
∗
=
[
w
0
,
w
1
,
w
2
]
w^*=[w_0,w_1,w_2]
w∗=[w0,w1,w2],即原多项式的各系数,这样就顺利得到了多项式回归模型,即成功求解原多项式回归问题。
python实现代码:
2.numpy实现多项式回归
import numpy as np
from matplotlib import pyplot as plt
def _2score(y, fit_y): #求R^2
y_mean = np.ones(len(y)) * y.mean()
SST = sum(np.square(y - y_mean))
SSE = sum(np.square(y - fit_y))
R2 = 1 - SSE / SST
return R2
def draw(x, y, fit_y): #绘图
plt.figure(figsize=(20,10),dpi=80)
plt.scatter(x, y,color='blue',linewidths=10)
plt.scatter(x, fit_y, color='red', linewidths=1)
plt.show()
def polynomial_curve_fitting(x, y, dim):
length = len(x)
mat = np.ones((dim + 1, length)) # (属性个数,单个属性元素个数)
i = 0
while i <= dim - 1:
mat[i] = x ** (i + 1)
i += 1
mat = np.matrix(mat.transpose())
w = np.dot(np.dot(np.dot(mat.transpose(), mat).I, mat.transpose()), y)
fit_y = np.zeros(5000)
w = np.array(w)
i = 0
for xx in np.array(mat): # 遍历二维矩阵,一行行返回
xx = np.squeeze(np.array(xx)) # squeeze解决矩阵相乘维度不匹配问题
w = np.squeeze(np.array(w))
fit_y[i] = np.dot(xx, w)
i += 1
return fit_y
def train(x, y):
dim = 16 # 最高维度
fit_y = polynomial_curve_fitting(x, y, dim) # 求出多项式回归后的拟合值y
R2 = _2score(y, fit_y) # 求R^2print(R2)
print(R2)
draw(x, y, fit_y) # 绘图
def test(x, y):
dim = 16
fit_y = polynomial_curve_fitting(x, y, dim)
R2 = _2score(y, fit_y)
print(R2)
draw(x, y, fit_y)
file = "regression.txt"
file2 = "regression_test.txt"
data_x = np.loadtxt(file, delimiter=',', usecols=1)
data_y = np.loadtxt(file, delimiter=',', usecols=0)
test_x = np.loadtxt(file2, delimiter=',', usecols=1)
test_y = np.loadtxt(file2, delimiter=',', usecols=0)
train(data_x, data_y)
# test(test_x, test_y)
3.numpy库函数实现多项式回归
import numpy as np
from sklearn.metrics import r2_score#使用多项式回归中的R^2
from matplotlib import pyplot as plt
file = "regression.txt"
file2 = "regression_test.txt"
data_x = np.loadtxt(file, delimiter=',', usecols=1)
data_y = np.loadtxt(file, delimiter=',', usecols=0)
test_x = np.loadtxt(file2, delimiter=',', usecols=1)
test_y = np.loadtxt(file2, delimiter=',', usecols=0)
model16 = np.poly1d(np.polyfit(data_x, data_y, 16))#生成多项式回归模型,相当于一种函数
plt.scatter(data_x, data_y)
# plt.scatter(test_x, test_y)
myline = np.linspace(-10, 10, 1000)#从-10到100有序生成1000个间隔相等的数
print(r2_score(data_y, model16(data_x)))#R^2计算,0~1,越靠近1说明拟合效果越好
# print(r2_score(test_y, model16(test_x))) #test
plt.plot(myline, model16(myline), color='red')
plt.show()
4.sklearn库函数求解多元回归中的 w ∗ w* w∗
import numpy as np
from sklearn import linear_model
file = "regression.txt"
data_x = np.loadtxt(file, delimiter=',', usecols=1)
data_y = np.loadtxt(file, delimiter=',', usecols=0)
dim = 6
length = len(data_x)
mat = np.ones((dim, length))
i = 0
while i <= dim - 1:
mat[i] = data_x ** (i + 1)
i += 1
mat = mat.transpose()
reg = linear_model.LinearRegression() #创建一个线性回归对象
reg.fit(mat, data_y) ##填充回归对象
w = reg.coef_ #reg.coef_即为w,各个xi的系数
5.批量梯度下降(Batch Gradient Descent)实现
失败了。。。。梯度下降求解多项式回归求出的代价函数很大,且速度很慢,效率低,想要把代价函数降低到一个可容忍的范围就只能通过增加迭代次数实现,然而短时间的训练迭代无法满足我们的需求,需要花费很长一段时间去迭代更新,在该样本集和该多项式回归模型下,BGD是不可行的;
import numpy as np
from matplotlib import pyplot as plt
file = "regression.txt"
data_x = np.loadtxt(file, delimiter=',', usecols=1)
data_y = np.loadtxt(file, delimiter=',', usecols=0)
m = len(data_x) #样本数据个数
dim = 2
mat = np.ones((dim + 1, m)) # (属性个数,单个属性元素个数)
i = 0
x = data_x
while i <= dim - 1:
mat[i + 1] = x ** (i + 1)
i += 1
xx = mat.transpose()
yy = data_y.reshape(m, 1)
theta = np.random.rand(dim + 1, 1)
alpha = 0.00000000002
for i in range(100000):
grad = xx.T.dot(xx.dot(theta) - yy) / m
theta = theta - alpha * grad
def draw(theta, x, y):
w2 = theta[2]
w1 = theta[1]
w0 = theta[0]
plt.scatter(x, y)
plt.scatter(x, w2 * x ** 2 + w1 * x + w0, color='red')
plt.show()
draw(theta, data_x, data_y)
6.随机梯度下降(Stochastic Gradient Descent)和小批量梯度下降(Mini-Batch Gradient Descent)
SGD与BGD的区别是:每次进行梯度下降时,BGD都是用的所有样本数据求梯度,这就导致了BGD的正确率相对会比较大(毕竟每次迭代都使用了所有的数据),但是相应的速度会非常慢(因为迭代时使用的数据太多了);
而SGD则有两层循环,外层循环控制迭代周期数,内存循环每次随机从整组数据中选取一个数据,循环次数一般为数据样本个数;虽然是每次随机取一个数进行梯度更新,但是由于更新次数多,使得总体上总会朝着全局最优解的方向移动,即前途是明确的,道路是曲折的,且由于每次更新的数据量比较小,所以速度会比BGD快很多,但是SGD也有一个明显的缺点就是由于每次进行迭代的数据量只有随机的一个,所以最后很容易陷入局部最优解;
MBGD综合了SGD和BGD各自的优点,是一种速度与正确率的折中;在随机取数据进行迭代时,并不是像SGD那样只取一个,而是随机取若干个数据,即batch size个数据(数据量还是不多,一般取2~100个),再进行迭代更新;这样的话既能加快速度,又能提高正确率;
(红色的是BGD,前往全局最优解的路线比较平滑,而粉红色的则是SGD,虽然比较曲折,但是和BGD一样最后能够到达全局最优解)
SGD梯度下降:
∂
J
∂
θ
j
=
(
θ
T
x
(
i
)
−
y
(
i
)
)
x
j
(
i
)
\frac{\partial J}{\partial \theta _j}=\left( \theta ^Tx^{\left( i \right)}-y^{\left( i \right)} \right) x_{j}^{\left( i \right)}
∂θj∂J=(θTx(i)−y(i))xj(i)
θ
=
θ
j
−
α
∂
J
∂
θ
j
=
θ
j
−
α
(
θ
T
x
(
i
)
−
y
(
i
)
)
x
j
(
i
)
\theta =\theta _j-\alpha \frac{\partial J}{\partial \theta _j}=\theta _j-\alpha \left( \theta ^Tx^{\left( i \right)}-y^{\left( i \right)} \right) x_{j}^{\left( i \right)}
θ=θj−α∂θj∂J=θj−α(θTx(i)−y(i))xj(i)
MBGD梯度下降:
θ
=
θ
j
−
1
m
∑
i
=
1
m
α
(
θ
T
x
(
i
)
−
y
(
i
)
)
x
j
(
i
)
\theta =\theta _j-\frac{1}{m}\sum_{i=1}^m{}\alpha \left( \theta ^Tx^{\left( i \right)}-y^{\left( i \right)} \right) x_{j}^{\left( i \right)}
θ=θj−m1i=1∑mα(θTx(i)−y(i))xj(i)
MBGD的python实现:
import numpy as np
from numpy import random
from matplotlib import pyplot as plt
def learning_rate(t): #学习率
return 1 / (t + 1000000000000000)#学习率如果不设小点会出现梯度太大甚至nan的情况
def draw(theta, x, y):
plt.scatter(x, y)
plt.scatter(x, theta[1] * x + theta[2] * x ** 2 + theta[3] * x ** 3+ theta[4] * x ** 4+ theta[5] * x ** 5 + theta[6] * x ** 6+ theta[0], color='red')
plt.show()
file = "regression.txt"
data_x = np.loadtxt(file, delimiter=',', usecols=1)
data_y = np.loadtxt(file, delimiter=',', usecols=0)
m = len(data_x) #样本数据个数
dim = 6
mat = np.ones((dim + 1, m)) # (属性个数,单个属性元素个数)
i = 0
x = data_x
while i <= dim - 1:
mat[i + 1] = x ** (i + 1)
i += 1
xx = mat.transpose()
yy = data_y.reshape(m, 1) #y变成列向量的形式
epochs = 20 #外部迭代周期数
batch_size = 40
theta = np.random.rand(dim + 1, 1) #theta初始时取随机值
for epoch in range(epochs):
for i in range(m): #内层循环遍历m次
random_index = random.randint(m, size=batch_size) #每次在m个样本数据中随机抽取batch_size个数据
xi = xx[random_index]
yi = yy[random_index]
grad = np.dot(xi.T, np.dot(xi, theta) - yi) / batch_size#梯度
theta = theta - learning_rate(epoch * m * 10000000000+ i * 10000000000) * grad #更新theta
draw(theta, data_x, data_y)
可能是由于数据的原因,当使用6次多项式时,梯度下降得到的结果是中间拟合得较好,然而数据两端出现了龙格现象,且调整learning rate,epochs和batch size的作用不大;
附:在sklearn库中还有个专门的用于SGD的模块SGDRegressor
from sklearn.linear_model import SGDRegressor
import numpy as np
def draw(theta, x, y):
plt.scatter(x, y)
plt.scatter(x, theta[1] * x + theta[2] * x ** 2 + theta[3] * x ** 3+ theta[4] * x ** 4+ theta[5] * x ** 5 + theta[6] * x ** 6+ theta[0], color='red')
plt.show()
file = "regression.txt"
data_x = np.loadtxt(file, delimiter=',', usecols=1)
data_y = np.loadtxt(file, delimiter=',', usecols=0)
m = len(data_x) #样本数据个数
dim = 6
mat = np.ones((dim + 1, m)) # (属性个数,单个属性元素个数)
i = 0
x = data_x
while i <= dim - 1:
mat[i + 1] = x ** (i + 1)
i += 1
xx = mat.transpose()
yy = data_y.reshape(m, 1) #y变成列向量的形式
s = SGDRegressor(max_iter=1000, tol=10, penalty=None, eta0=0.0000000001)
s.fit(xx, yy.ravel())
draw(s.coef_, data_x, data_y)
7.线性回归正则化
在线性回归中,训练好的模型在遇到新数据时经常会出现过拟合的情况;即可能求出的回归式能够符合样本的分布,但是一旦应用于新样本点,预测值会出现较大的偏差,特别是含有高次幂项的部分;我们知道,在多项式回归中,属性项次数越高,就越容易拟合样本数据,但是相应的,这样很容易发生龙格现象,预测值会随数据的改变而发生非常大的变化,从而无法很好应用于新数据集,即过拟合了;这是我们不想看到的,因此我们要对高次项做惩罚,即使其在回归式中的系数
θ
\theta
θ尽可能小。
这就是正则化,可以在一定程度上缓解并改善过拟合问题;
具体的方法是在构建代价函数时,在高次项系数前加上比较大的数;这样在梯度下降时,由于求的是最小值,而高次项系数前的数比较大,就会加大对该系数的惩罚,即尽可能减少该高次项系数,如
f
(
x
)
=
θ
0
+
θ
1
x
+
θ
2
x
2
+
θ
3
x
3
f\left( x \right) =\theta_0+\theta_1x+\theta_2x^2+\theta_3x^3
f(x)=θ0+θ1x+θ2x2+θ3x3,在写代价函数时,我们可以写成
J
(
Θ
)
=
1
2
m
min
(
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
2
+
1000
θ
2
2
+
10000
θ
3
2
)
J\left( \varTheta \right) =\frac{1}{2m}\min \left( \sum{_{i=1}^{m}\left( h_{\theta}\left( x^{\left( i \right)} \right) -y^{\left( i \right)} \right) ^2}+1000\theta_{2}^{2}+10000\theta_{3}^{2} \right)
J(Θ)=2m1min(∑i=1m(hθ(x(i))−y(i))2+1000θ22+10000θ32),这样高次项系数
θ
2
,
θ
3
\theta_2,\theta_3
θ2,θ3就会被减小,即实现了正则化的效果。
线性回归正则化一般表达式:
J
(
Θ
)
=
1
2
m
min
(
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
2
+
λ
∑
j
=
1
n
θ
j
2
)
J\left( \varTheta \right) =\frac{1}{2m}\min \left( \sum{_{i=1}^{m}\left( h_{\theta}\left( x^{\left( i \right)} \right) -y^{\left( i \right)} \right) ^2}+\lambda \sum_{j=1}^n{\theta _{j}^{2}} \right)
J(Θ)=2m1min(∑i=1m(hθ(x(i))−y(i))2+λj=1∑nθj2)
其中
λ
\lambda
λ被称为正则化参数,我们一般不对
θ
\theta
θ做正则化处理,但若
λ
\lambda
λ太大,就会使得除
θ
0
\theta_0
θ0之外的
θ
\theta
θ值都过小,若
λ
\lambda
λ太小,又达不到效果,因此要选择合适的
λ
\lambda
λ值。
即
θ
j
=
θ
j
−
α
[
1
m
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
x
j
(
i
)
+
λ
m
θ
j
]
\theta _j=\theta_j-\alpha \left[ \frac{1}{m}\sum_{i=1}^m{\left( h_{\theta}\left( x^{\left( i \right)} \right) -y^{\left( i \right)} \right) x_{j}^{\left( i \right)}}+\frac{\lambda}{m}\theta _j \right]
θj=θj−α[m1i=1∑m(hθ(x(i))−y(i))xj(i)+mλθj]
变形得:
θ
j
=
θ
j
(
1
−
α
λ
m
)
−
α
1
m
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
x
j
(
i
)
\theta _j=\theta _j\left( 1-\alpha \frac{\lambda}{m} \right) -\alpha \frac{1}{m}\sum_{i=1}^m{\left( h_{\theta}\left( x^{\left( i \right)} \right) -y^{\left( i \right)} \right) x_{j}^{\left( i \right)}}
θj=θj(1−αmλ)−αm1i=1∑m(hθ(x(i))−y(i))xj(i)
可以神奇的发现该式子与原迭代式只有
θ
j
\theta_j
θj这里的系数有所改变,其他都是一样的,因此正则化的表达式可以视为原
θ
\theta
θ乘以了一个比1小点点的数。