一、岭回归
1、原理
如果对代价函数
J
(
θ
)
=
(
X
θ
−
y
)
T
(
X
θ
−
y
)
J(θ)=(Xθ-y)^T(Xθ-y)
J(θ)=(Xθ−y)T(Xθ−y)不加任何限制,那么对
θ
θ
θ求导,可以得到
θ
θ
θ矩阵形式的解
θ
=
(
X
T
X
)
−
1
X
T
y
θ=(X^TX)^{-1}X^Ty
θ=(XTX)−1XTy。这里会出现一个问题:如果得到的样本个数小于样本的特征数,也就是说训练集矩阵
X
X
X的行维度
m
m
m小于列维度
d
d
d,那么
X
T
X
X^TX
XTX可能不可逆,也就是无法解出
θ
θ
θ。为了避免这个问题,我们给
X
T
X
X^TX
XTX加上一个
λ
I
λI
λI,最终的解也变为
θ
=
(
X
T
X
+
λ
I
)
−
1
X
T
y
θ=(X^TX+λI)^{-1}X^Ty
θ=(XTX+λI)−1XTy。
实质上,岭回归是一种缩减系数的方法,也就是剔除或者减弱不重要的参数:我们知道
L
2
L_2
L2正则化下的代价函数为:
J
(
θ
)
=
1
2
m
∑
i
=
1
m
(
h
θ
(
x
i
)
−
y
i
)
+
λ
2
m
∑
i
=
1
m
θ
i
2
J(θ)=\frac{1}{2m}\sum_{i=1}^m(h_θ(x_i)-y_i)+\frac{λ}{2m}\sum_{i=1}^mθ_i^2
J(θ)=2m1∑i=1m(hθ(xi)−yi)+2mλ∑i=1mθi2,如果写成矩阵形式则为:
J
(
θ
)
=
1
2
m
(
X
θ
−
y
)
T
(
X
θ
−
y
)
+
λ
2
m
θ
T
θ
J(θ)=\frac{1}{2m}(Xθ-y)^T(Xθ-y)+\frac{λ}{2m}θ^Tθ
J(θ)=2m1(Xθ−y)T(Xθ−y)+2mλθTθ,对
J
(
θ
)
J(θ)
J(θ)关于
θ
θ
θ求导并令其为0:
∂
J
(
θ
)
∂
θ
=
1
m
X
T
(
X
θ
−
y
)
+
λ
m
θ
=
0
\frac{∂J(θ)}{∂θ}=\frac{1}{m}X^T(Xθ-y)+\frac{λ}{m}θ=0
∂θ∂J(θ)=m1XT(Xθ−y)+mλθ=0
→
→
→
θ
=
(
X
T
X
+
λ
I
)
X
T
y
θ=(X^TX+λI)X^Ty
θ=(XTX+λI)XTy
也就是说岭回归等同于就是
L
2
L_2
L2正则化,而对于
L
2
L_2
L2正则化的问题
min
J
(
θ
)
→
min
(
∑
i
=
1
m
(
h
θ
(
x
i
)
2
−
y
i
)
+
λ
∑
i
=
1
m
θ
i
2
)
\min J(θ)→\min (\sum_{i=1}^m(h_θ(x_i)^2-y_i)+λ\sum_{i=1}^mθ_i^2)
minJ(θ)→min(∑i=1m(hθ(xi)2−yi)+λ∑i=1mθi2)就是约束问题:
min
∑
i
=
1
m
(
h
θ
(
x
i
)
−
y
i
)
2
\min \sum_{i=1}^m(h_θ(x_i)-y_i)^2
mini=1∑m(hθ(xi)−yi)2
s
.
t
.
s.t.
s.t.
∑
i
=
1
m
θ
i
2
≤
t
\sum_{i=1}^mθ_i^2≤t
i=1∑mθi2≤t
也就是说在满足条件
∑
i
=
1
m
θ
i
2
≤
t
\sum_{i=1}^mθ_i^2≤t
∑i=1mθi2≤t的前提下最小化平方误差。而约束项
∑
i
=
1
m
θ
i
2
≤
t
\sum_{i=1}^mθ_i^2≤t
∑i=1mθi2≤t会导致系数
θ
i
θ_i
θi被压缩,甚至一些系数趋于0,这就达成了缩减系数的目的。
实际上,缩减系数和正则化也并不矛盾:正则化的目的为防止过拟合,而过拟合的原因一部分是由于样本太少,而特征太多导致模型会把一些不重要的数据特征当做所有数据的共性。而缩减系数正是排除或减弱了不重要特征。
2、实际应用
这里我们我们把岭回归应用于一组数据,看看其缩减系数的效果
①查看并处理数据
首先查看数据:
file = open('F:/MachineLearning/data/abalone.txt')
print(file.read())
得到:
F,0.67,0.51,0.155,1.278,0.5605,0.3045,0.358,11
M,0.67,0.54,0.195,1.217,0.532,0.2735,0.3315,11
F,0.67,0.54,0.2,1.46,0.6435,0.328,0.4165,9
F,0.675,0.535,0.185,1.5575,0.7035,0.402,0.4,11
M,0.675,0.51,0.17,1.527,0.809,0.318,0.341,11
F,0.675,0.53,0.195,1.4985,0.62,0.375,0.425,9
M,0.685,0.55,0.19,1.885,0.89,0.41,0.4895,10
M,0.685,0.535,0.175,1.432,0.637,0.247,0.46,11
M,0.705,0.55,0.21,1.4385,0.655,0.3255,0.462,11
F,0.705,0.53,0.17,1.564,0.612,0.394,0.44,10
M,0.71,0.555,0.175,2.14,1.2455,0.3725,0.434,11
F,0.725,0.56,0.185,1.792,0.873,0.367,0.435,11
M,0.78,0.6,0.21,2.548,1.1945,0.5745,0.6745,11
I,0.235,0.13,0.075,0.1585,0.0685,0.037,0.0465,5
I,0.35,0.25,0.1,0.4015,0.1725,0.063,0.1255,7
I,0.36,0.25,0.115,0.465,0.21,0.1055,0.128,7
I,0.38,0.28,0.095,0.2885,0.165,0.0435,0.067,7
F,0.38,0.32,0.115,0.6475,0.323,0.1325,0.164,7
M,0.43,0.31,0.13,0.6485,0.2735,0.163,0.184,9
I,0.465,0.36,0.105,0.452,0.22,0.159,0.1035,9
I,0.47,0.355,0.12,0.4915,0.1765,0.1125,0.1325,9
F,0.485,0.365,0.15,0.9145,0.4145,0.199,0.273,7
M,0.495,0.375,0.155,0.976,0.45,0.2285,0.2475,9
I,0.5,0.395,0.145,0.7865,0.332,0.1815,0.2455,8
M,0.505,0.4,0.15,0.775,0.3445,0.157,0.185,7
I,0.51,0.375,0.15,0.8415,0.3845,0.156,0.255,10
M,0.51,0.38,0.135,0.681,0.3435,0.142,0.17,9
M,0.515,0.37,0.115,0.6145,0.3415,0.155,0.146,9
F,0.55,0.415,0.18,1.1655,0.502,0.301,0.311,9
F,0.575,0.42,0.19,1.764,0.914,0.377,0.4095,10
M,0.605,0.455,0.16,1.1215,0.533,0.273,0.271,10
M,0.615,0.505,0.165,1.167,0.4895,0.2955,0.345,10
M,0.615,0.475,0.15,1.0375,0.476,0.2325,0.283,9
...
这里的数据是一组鲍鱼一系列长度数据,如体长、厚度等;我们知道体长厚度一定是和其年龄有一定关系,即年龄越大,体长重量等数据越大,即成正相关,所以这里我们可以采用线性回归粗略的估计。这组数据需要注意的是:1>第一个特征为鲍鱼的性别,M代表母,F代表公,I代表幼儿,我们这里分别用0,1,-1代替;2>最后一个特征为鲍鱼年龄即我们需要预测的数据,所以应该单独提出;3>这里的数据中不同样本的一个属性的数据差值略大,所以应该进行归一化处理。所以处理数据的函数如下:
import numpy as np
def prepared_data(path):
file=open(path)
fr_list = file.readlines()
data_list=[]
for data in fr_list:
data = data.strip()
data = data.split(',')
if data[0] == 'F':
data[0] = 1
elif data[0] == 'I':
data[0] = 0
else:
data[0] =-1
data = [float(i) for i in data]
data_list.append(data)
data_mat = np.array(data_list)
X = data_mat[:,:-1]
y = data_mat[:,-1]
X = (X-np.mean(X,axis=0)) / np.std(X,axis=0)
y = y - np.mean(y)
return X,y
处理并查看数据:
X,y = prepared_data('F:/MachineLearning/data/abalone.txt')
print(X)
print(y)
得到:
[[-1.15198011 -0.57455813 -0.43214879 ... -0.60768536 -0.72621157
-0.63821689]
[-1.15198011 -1.44898585 -1.439929 ... -1.17090984 -1.20522124
-1.21298732]
[ 1.28068972 0.05003309 0.12213032 ... -0.4634999 -0.35668983
-0.20713907]
...
[-1.15198011 0.6329849 0.67640943 ... 0.74855917 0.97541324
0.49695471]
[ 1.28068972 0.84118198 0.77718745 ... 0.77334105 0.73362741
0.41073914]
[-1.15198011 1.54905203 1.48263359 ... 2.64099341 1.78744868
1.84048058]]
[ 5.06631554 -2.93368446 -0.93368446 ... -0.93368446 0.06631554
2.06631554]
③岭回归代码
岭回归代码较为简单,这里注意的是
X
T
X
+
λ
I
X^TX+λI
XTX+λI不一定可逆,所以应判断:
def ridge_regression(X,y,lambd):
m,d = X.shape
X_X = X.T @ X + lambd * np.eye(d)
if np.linalg.det(X_X) == 0:
print('The mat can not be inverse')
return
theta = np.linalg.inv(X_X) @ X.T @ y
return theta
file = open('F:/MachineLearning/data/abalone.txt')
print(file.read())
④运行并查看缩减结果
theta = ridge_regression(X,y,100)
print(theta)
theta_inx=np.argsort(-np.abs(theta))
print(theta_inx)
print(theta[theta_inx])
fig1,ax1 = plt.subplots(figsize=(9,6))
ax1.plot(np.abs(theta[theta_inx]))
plt.show()
这里,我们设置
λ
λ
λ为100,并查看相同
λ
λ
λ下不同属性对应参数的绝对值:
可以看到最后一个属性对应的参数几乎趋近于0。如果设置大一些,如1000:
发现几乎所有参数绝对值都小于1,一定会于发生偏差,产生欠拟合。我们再把参数设置小一些,如0.01:
也起到了一定效果,但是效果并没有前面明显。或者说这是应为模型计算出的几个参数本身就很小。
从上面可以看出
λ
λ
λ的选择对于岭回归模型即为重要,具体选择还需要选取一定数量的
λ
λ
λ在测试集中验证,选择在测试集中得到最小错误率的
λ
λ
λ。
二、LASSO
上面我们看到了利用
L
2
L_2
L2正则化变换来的岭回归的缩减系数的方法,
L
2
L_2
L2正则化为:
min
∑
i
=
1
m
(
h
θ
(
x
i
)
−
y
i
)
+
∑
i
=
1
m
θ
i
2
\min \sum_{i=1}^m(h_θ(x_i)-y_i)+\sum_{i=1}^mθ_i^2
mini=1∑m(hθ(xi)−yi)+i=1∑mθi2
如果我们把
L
2
L_2
L2正则化换为
L
1
L_1
L1正则化,那么问题又会怎样呢?这是问题就变为:
min
∑
i
=
1
m
(
h
θ
(
x
i
)
−
y
i
)
+
∑
i
=
1
m
∣
θ
i
∣
\min \sum_{i=1}^m(h_θ(x_i)-y_i)+\sum_{i=1}^m|θ_i|
mini=1∑m(hθ(xi)−yi)+i=1∑m∣θi∣
(这里省去了与极小化无关的
1
2
m
\frac{1}{2m}
2m1)
如果我们把两种情况用图来表示,则会有如下的图:
这里为了更直观的说明LASSO和岭回归的区别,我们假设模型只含有两个参数
θ
1
θ_1
θ1和
θ
2
θ_2
θ2:我们先画出平方误差项等值线,即在
(
θ
1
,
θ
2
)
(θ_1,θ_2)
(θ1,θ2)空间中平方误差项取值相同的点的连线,再画出
L
1
L_1
L1、
L
2
L_2
L2正则化的等值线。最小值一定出现在平方误差等值线和正则化项的等值线的相交处。而如果采用
L
2
L_2
L2正则化,相交处大多位于非坐标轴上;而采用
L
1
L_1
L1正则化,相交处大多位于坐标上。
所以总结一下,就是说岭回归得到的参数虽然被压缩,但是几乎不可能完全把参数压缩为0;而LASSO易于得到稀疏解,也就是
θ
θ
θ存在非0分量。由于LASSO这个特性,也可用来特征选择。
而如何得到LASSO的解呢?由于绝对值函数
∣
θ
∣
|θ|
∣θ∣的不可导,所以希望得到像岭回归那样的显式解是很困难的,所以采用近端梯度下降的方法。对于优化目标
min
f
(
θ
)
+
λ
∣
θ
∣
\min f(θ)+λ|θ|
minf(θ)+λ∣θ∣。如果
f
(
x
)
f(x)
f(x) 可导,且
▽
f
(
x
)
▽f(x)
▽f(x)满足L-Lipschitz条件,则有: 存在常数
L
>
0
L>0
L>0:
∣
▽
f
(
x
1
)
−
▽
f
(
x
2
)
∣
≤
L
∣
x
1
−
x
2
∣
|▽f(x_1)-▽f(x_2)|≤L|x_1-x_2|
∣▽f(x1)−▽f(x2)∣≤L∣x1−x2∣
→
→
→
∣
▽
f
(
x
1
)
−
▽
f
(
x
2
)
∣
∣
x
1
−
x
2
∣
≤
L
\frac{|▽f(x_1)-▽f(x_2)|}{|x_1-x_2|}≤L
∣x1−x2∣∣▽f(x1)−▽f(x2)∣≤L
→
→
→
lim
x
1
→
x
2
∣
▽
f
(
x
1
)
−
▽
f
(
x
2
)
∣
∣
x
1
−
x
2
∣
=
▽
2
f
(
x
)
≤
L
\lim_{x_1→x_2}\frac{|▽f(x_1)-▽f(x_2)|}{|x_1-x_2|}=▽^2f(x)≤L
x1→x2lim∣x1−x2∣∣▽f(x1)−▽f(x2)∣=▽2f(x)≤L
如果我们把
f
(
x
)
f(x)
f(x)在
x
k
x_k
xk处泰勒展开:
f
(
x
)
=
f
(
x
k
)
+
▽
f
(
x
k
)
(
x
−
x
k
)
+
▽
2
f
(
x
k
)
2
(
x
−
x
k
)
2
f(x)=f(x_k)+▽f(x_k)(x-x_k)+\frac{▽^2f(x_k)}{2}(x-x_k)^2
f(x)=f(xk)+▽f(xk)(x−xk)+2▽2f(xk)(x−xk)2
≤
f
(
x
k
)
+
▽
f
(
x
k
)
(
x
−
x
k
)
+
L
2
(
x
−
x
k
)
2
≤f(x_k)+▽f(x_k)(x-x_k)+\frac{L}{2}(x-x_k)^2
≤f(xk)+▽f(xk)(x−xk)+2L(x−xk)2
=
L
2
x
2
−
L
x
k
x
+
L
2
x
k
2
+
▽
f
(
x
k
)
x
−
▽
f
(
x
k
)
x
k
+
f
(
x
k
)
=\frac{L}{2}x^2-Lx_kx+\frac{L}{2}x_k^2+▽f(x_k)x-▽f(x_k)x_k+f(x_k)
=2Lx2−Lxkx+2Lxk2+▽f(xk)x−▽f(xk)xk+f(xk)
=
L
2
(
x
−
(
x
k
−
1
L
▽
f
(
x
k
)
)
)
2
+
c
o
n
s
t
=\frac{L}{2}(x-(x_k-\frac{1}{L}▽f(x_k)))^2+const
=2L(x−(xk−L1▽f(xk)))2+const
所以有
f
(
x
)
+
λ
∣
x
∣
≤
L
2
(
x
−
(
x
k
−
1
L
▽
f
(
x
k
)
)
)
2
+
λ
∣
x
∣
+
c
o
n
s
t
f(x)+λ|x|≤\frac{L}{2}(x-(x_k-\frac{1}{L}▽f(x_k)))^2+λ|x|+const
f(x)+λ∣x∣≤2L(x−(xk−L1▽f(xk)))2+λ∣x∣+const
于是我们只要最小化小于等于号右边就可以把符号左边的式子控制在一个相对较小的范围,所以我们可以认为最小化右边近似最小化等式左边。所以问题转化为:
x
k
+
1
=
arg
x
min
L
2
(
x
−
(
x
k
−
1
L
▽
f
(
x
k
)
)
)
2
+
λ
∣
x
∣
x_{k+1}=\arg_x \min \frac{L}{2}(x-(x_k-\frac{1}{L}▽f(x_k)))^2+λ|x|
xk+1=argxmin2L(x−(xk−L1▽f(xk)))2+λ∣x∣
我们令
z
=
x
k
−
1
L
▽
f
(
x
k
)
z=x_k-\frac{1}{L}▽f(x_k)
z=xk−L1▽f(xk),所以
x
k
+
1
=
arg
x
min
L
2
(
x
−
z
)
2
+
λ
∣
x
∣
x_{k+1}=\arg_x \min \frac{L}{2}(x-z)^2+λ|x|
xk+1=argxmin2L(x−z)2+λ∣x∣
右边函数可以转化为:
当
x
≥
0
x≥0
x≥0,
g
1
(
x
)
=
L
2
(
x
−
z
)
2
+
λ
x
=
L
2
(
x
−
L
z
−
λ
L
)
+
c
o
n
s
t
1
g_1(x)=\frac{L}{2}(x-z)^2+λx=\frac{L}{2}(x-\frac{Lz-λ}{L})+const_1
g1(x)=2L(x−z)2+λx=2L(x−LLz−λ)+const1;
当
x
<
0
x<0
x<0,
g
2
(
x
)
=
L
2
(
x
−
z
)
2
−
λ
x
=
L
2
(
x
−
L
z
+
λ
L
)
+
c
o
n
s
t
2
g_2(x)=\frac{L}{2}(x-z)^2-λx=\frac{L}{2}(x-\frac{Lz+λ}{L})+const_2
g2(x)=2L(x−z)2−λx=2L(x−LLz+λ)+const2
由于函数不存在
x
i
x
j
x^ix^j
xixj项,所以
x
x
x各分量互不影响。
这显然是一个分段函数,且两端函数的交点一定相交于
y
y
y轴。下面为
g
1
g_1
g1、
g
2
g_2
g2函数图像:
红色曲线为
g
1
g_1
g1,蓝色为
g
2
g_2
g2。因为
L
z
−
λ
L
<
L
z
+
λ
L
\frac{Lz-λ}{L}<\frac{Lz+λ}{L}
LLz−λ<LLz+λ。所以如果
L
z
−
λ
L
>
0
\frac{Lz-λ}{L}>0
LLz−λ>0,则两个函数的最小值对应的
x
x
x都大于0,去除
g
1
g_1
g1小于0的部分,
g
2
g_2
g2大于0的部分,所以只能取到
g
1
g_1
g1的最小值。也就是,当
L
z
−
λ
L
>
0
→
z
>
λ
L
\frac{Lz-λ}{L}>0→z>\frac{λ}{L}
LLz−λ>0→z>Lλ,
x
=
z
−
λ
L
x=z-\frac{λ}{L}
x=z−Lλ;同理,
z
<
λ
L
z<\frac{λ}{L}
z<Lλ,
x
=
z
+
λ
L
x=z+\frac{λ}{L}
x=z+Lλ;当
L
z
−
λ
L
≤
0
\frac{Lz-λ}{L}≤0
LLz−λ≤0且
L
z
+
λ
L
≥
0
\frac{Lz+λ}{L}≥0
LLz+λ≥0,即
∣
z
∣
≤
λ
L
|z|≤\frac{λ}{L}
∣z∣≤Lλ时,这时两个函数的最小值都取不到,这时最小值在
x
x
x=0取得。我们把得到的结论写成分量形式:
从上式可以看出,如果
λ
/
L
<
z
i
λ/L<z^i
λ/L<zi,最终会减去
λ
/
L
λ/L
λ/L,经过迭代,最终一定会使
z
k
≤
λ
/
L
z^k≤λ/L
zk≤λ/L,最终一定会有
x
k
x^k
xk为0;
z
i
<
−
λ
/
L
z^i<-λ/L
zi<−λ/L时情况一定相同。最终可以得到稀疏解。
三、前向逐步回归
前向逐步算法开始将所有权重设置为1,然后每一步所做的决策是对某个权重增加或减少一个很小的值。具体算法的伪代码如下:
算法的代码部分如下:
def stageWise_regression(X,y,NumOfIt,episilon):
X = (X-np.mean(X,axis=0)) / np.std(X,axis=0)
y = y - np.mean(y)
m,d = X.shape
theta = np.zeros(d)
thetaMat = np.zeros((NumOfIt,d))
for i in range(NumOfIt):
lowestError = 100000
for j in range(d):
for sign in [-1,1]:
thetaTest=theta.copy()
thetaTest[j] += sign * episilon
Error = cost(thetaTest,X,y)
if Error < lowestError:
lowestError=Error
thetaBest = thetaTest
theta = thetaBest
thetaMat[i,:]=theta
return thetaMat
def cost(theta,X,y):
m,d=X.shape
cost=(X @ theta - y).T @ (X @ theta - y)/(2*m)
return cost
这实际上是一种贪心算法,因为实际上每轮迭代中我们都会重新设置一个lowestError
,也就是说,我们不能保证这一轮迭代的误差比上一轮迭代跟小,但是可以保证在这一轮迭代中我们的误差最小。我们再次使用鲍鱼数据及其处理数据函数进行试验,但是这里我们已经在stageWise_regression
函数中已经将数据归一化,所以可以去掉
def prepared_data(path):
file=open(path)
fr_list = file.readlines()
data_list=[]
for data in fr_list:
data = data.strip()
data = data.split(',')
if data[0] == 'F':
data[0] = 1
elif data[0] == 'I':
data[0] = 0
else:
data[0] =-1
data = [float(i) for i in data]
data_list.append(data)
data_mat = np.array(data_list)
X = data_mat[:,:-1]
y = data_mat[:,-1]
return X,y
thetaMat=stageWise_regression(X,y,5000,0.001)
print(thetaMat)
得到:
[[ 0.000e+00 0.000e+00 0.000e+00 ... 0.000e+00 0.000e+00 1.000e-03]
[ 0.000e+00 0.000e+00 0.000e+00 ... 0.000e+00 0.000e+00 2.000e-03]
[ 0.000e+00 0.000e+00 0.000e+00 ... 0.000e+00 0.000e+00 3.000e-03]
...
[ 0.000e+00 0.000e+00 6.440e-01 ... -1.552e+00 0.000e+00 2.351e+00]
[ 0.000e+00 0.000e+00 6.440e-01 ... -1.553e+00 0.000e+00 2.351e+00]
[ 0.000e+00 0.000e+00 6.450e-01 ... -1.553e+00 0.000e+00 2.351e+00]]
(这里数据归一化函数为 x n o r m = x − x m e a n σ x_{norm}=\frac{x-x_{mean}}{σ} xnorm=σx−xmean,这里的 σ σ σ为标准差,而很多地方这里的分子为方差 σ 2 σ^2 σ2,我觉得应该是错误的)