线性回归模型
假设样本 T = { ( x ( 1 ) , y ( 1 ) ) , ⋅ ⋅ ⋅ , ( x ( m ) , y ( m ) ) } T=\{(x^{(1)},y^{(1)}),\cdot \cdot \cdot,(x^{(m)},y^{(m)})\} T={(x(1),y(1)),⋅⋅⋅,(x(m),y(m))},其中 x ( i ) = ( x 1 i , x 2 i , ⋅ ⋅ ⋅ , x n i ) T x^{(i)}=(x^{i}_{1},x^{i}_{2},\cdot \cdot \cdot,x^{i}_{n})^{T} x(i)=(x1i,x2i,⋅⋅⋅,xni)T
模型
y
^
=
h
(
x
)
=
ω
0
+
ω
1
+
x
1
+
⋅
⋅
⋅
+
ω
n
x
n
=
ω
T
x
=
x
T
ω
\begin{aligned} \widehat{y}&=h(x)=\omega_{0}+\omega_{1}+x_{1}+\cdot \cdot \cdot+\omega_{n}x_{n} \\ &=\omega^{T}x =x^{T}\omega \end{aligned}
y
=h(x)=ω0+ω1+x1+⋅⋅⋅+ωnxn=ωTx=xTω
其中
ω
=
[
ω
0
ω
1
⋅
⋅
⋅
ω
n
]
,
x
=
[
1
x
1
⋅
⋅
⋅
x
n
]
\omega= \begin{bmatrix} \omega_{0}\\ \omega_{1}\\ \cdot \cdot \cdot\\ \omega_{n}\\ \end{bmatrix} , x= \begin{bmatrix} 1\\ x_{1}\\ \cdot \cdot \cdot\\ x_{n}\\ \end{bmatrix}
ω=⎣
⎡ω0ω1⋅⋅⋅ωn⎦
⎤,x=⎣
⎡1x1⋅⋅⋅xn⎦
⎤
损失函数
l
(
x
i
)
=
1
2
(
y
^
(
i
)
−
y
(
i
)
)
2
l(x^{i})=\frac{1}{2}(\widehat{y}^{(i)}-y^{(i)})^{2}
l(xi)=21(y
(i)−y(i))2
这是单个点与预测点之间的差距,若将所有差距加起来则构成了最后的目标函数,即求解所有差距的最小值。
目标函数
J
(
ω
)
=
1
2
∑
i
=
1
m
(
y
^
(
i
)
−
y
(
i
)
)
2
J(\omega)=\frac{1}{2}\sum^{m}_{i=1}{(\widehat{y}^{(i)}-y^{(i)})^{2}}
J(ω)=21i=1∑m(y
(i)−y(i))2
可以看到求解该模型其实也就是求解出最后的 ω \omega ω,以下是两种求解方法
最小二乘法
这里采用矩阵形式进行求解,可以将上述模型写成如下:
Y
^
=
[
y
^
(
1
)
y
^
(
2
)
⋅
⋅
⋅
y
^
(
m
)
]
=
[
1
x
1
(
1
)
x
2
(
1
)
⋅
⋅
⋅
x
n
(
1
)
1
x
1
(
2
)
x
2
(
2
)
⋅
⋅
⋅
x
n
(
w
)
⋅
⋅
⋅
1
x
1
(
m
)
x
1
(
m
)
⋅
⋅
⋅
x
n
(
m
)
]
[
ω
0
ω
1
⋅
⋅
⋅
ω
n
]
=
X
ω
\widehat{Y}= \begin{bmatrix} \widehat{y}^{(1)}\\ \widehat{y}^{(2)}\\ \cdot \cdot \cdot\\ \widehat{y}^{(m)}\\ \end{bmatrix} =\begin{bmatrix} 1 & x^{(1)}_{1} & x^{(1)}_{2} &\cdot \cdot \cdot&x^{(1)}_{n} \\ 1 & x^{(2)}_{1}& x^{(2)}_{2}&\cdot \cdot \cdot&x^{(w)}_{n}\\ &&\cdot \cdot \cdot\\ 1 & x^{(m)}_{1}& x^{(m)}_{1}&\cdot \cdot \cdot&x^{(m)}_{n} \end{bmatrix} \begin{bmatrix} \omega_{0}\\ \omega_{1}\\ \cdot \cdot \cdot\\ \omega_{n}\\ \end{bmatrix}=X\omega
Y
=⎣
⎡y
(1)y
(2)⋅⋅⋅y
(m)⎦
⎤=⎣
⎡111x1(1)x1(2)x1(m)x2(1)x2(2)⋅⋅⋅x1(m)⋅⋅⋅⋅⋅⋅⋅⋅⋅xn(1)xn(w)xn(m)⎦
⎤⎣
⎡ω0ω1⋅⋅⋅ωn⎦
⎤=Xω
模型
J
(
ω
)
=
1
2
∣
∣
X
ω
−
Y
∣
∣
2
=
1
2
(
X
ω
−
Y
)
T
(
X
ω
−
Y
)
=
1
2
(
ω
T
X
T
−
Y
)
(
X
ω
−
Y
)
=
1
2
(
ω
T
X
T
X
ω
−
ω
T
X
T
Y
−
Y
T
X
ω
+
Y
T
Y
)
=
1
2
(
ω
T
X
T
X
ω
−
2
ω
T
X
T
Y
+
Y
T
Y
)
\begin{aligned} J(\omega)&=\frac{1}{2}||X\omega-Y||^{2}\\ &=\frac{1}{2}(X\omega-Y)^{T}(X\omega-Y)\\ &=\frac{1}{2}(\omega^{T}X^{T}-Y)(X\omega-Y)\\ &=\frac{1}{2}(\omega^{T}X^{T}X\omega-\omega^{T}X^{T}Y-Y^{T}X\omega+Y^{T}Y)\\ &=\frac{1}{2}(\omega^{T}X^{T}X\omega-2\omega^{T}X^{T}Y+Y^{T}Y) \end{aligned}
J(ω)=21∣∣Xω−Y∣∣2=21(Xω−Y)T(Xω−Y)=21(ωTXT−Y)(Xω−Y)=21(ωTXTXω−ωTXTY−YTXω+YTY)=21(ωTXTXω−2ωTXTY+YTY)
这里的矩阵求导会用到下面几个公式:
d
X
T
X
d
X
=
2
x
,
d
A
X
d
X
=
A
T
,
d
X
T
A
d
X
=
A
,
∂
X
T
A
X
∂
X
=
2
A
X
\frac{dX^{T}X}{dX}=2x,\frac{dAX}{dX}=A^{T},\frac{dX^{T}A}{dX}=A,\frac{\partial X^{T}AX}{\partial X}=2AX
dXdXTX=2x,dXdAX=AT,dXdXTA=A,∂X∂XTAX=2AX
对目标函数进行求导
∂
J
(
ω
)
∂
ω
=
1
2
(
2
X
T
X
ω
−
2
X
T
Y
)
=
X
T
X
ω
−
X
T
Y
\frac{\partial J(\omega)}{\partial \omega}=\frac{1}{2}(2X^{T}X\omega-2X^{T}Y)=X^{T}X\omega-X^{T}Y
∂ω∂J(ω)=21(2XTXω−2XTY)=XTXω−XTY
令 ∂ J ( ω ) ∂ ω = 0 \frac{\partial J(\omega)}{\partial \omega}=0 ∂ω∂J(ω)=0,即可求出 ω = ( X T X ) − 1 X T Y \omega=(X^{T}X)^{-1}X^{T}Y ω=(XTX)−1XTY,最后将 ω \omega ω代入 y ^ = h x \widehat{y}=h{x} y =hx即求出。
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
df=pd.read_csv('house_price.csv')
df['price']=df['price'][:7].astype(int)
df=df.values
x=df[:7,1]
y=df[:7,2]
m=y.size
'''最小二乘法'''
#数据可视化
plt.scatter(x, y)
# x矩阵第一列为1
a=np.ones((m,1))
x=np.column_stack((a,x))
#计算权重w
w=np.linalg.inv(x.T@x)@x.T@y
#再次可视化,拟合后的曲线
i=0
b = np.ones(7)
while i<7:
b[i] = x[i]@w.T
i = i+1
plt.plot(x[:,1], b)
# 预测
x_predict=700
y_predict=np.array([1,x_predict])@w.T
print(y_predict)
梯度下降法
思想:迭代法求解优化模型
格式:
X
k
+
1
=
X
k
+
α
d
k
X_{k+1}=X_{k}+\alpha d_{k}
Xk+1=Xk+αdk,
α
\alpha
α表示步长,
d
k
d_{k}
dk表示梯度方向。
梯度方向是函数值增加最快的方向。
模型:对代价函数求导可得 d k ( ω j ) = ∂ J ( ω ) ∂ ω = ∑ i = 1 m ( h ( x i ) − y ( i ) ) ⋅ x j ( i ) d_{k}(\omega_{j})=\frac{\partial J(\omega)}{\partial \omega}=\sum^{m}_{i=1}(h(x^{i})-y^{(i)})\cdot x^{(i)}_{j} dk(ωj)=∂ω∂J(ω)=i=1∑m(h(xi)−y(i))⋅xj(i)
单点梯度下降算法
Step1: 初始化
α
,
ε
,
ω
j
(
j
=
0
,
⋅
⋅
⋅
,
n
)
,
k
,
M
a
x
N
\alpha,\varepsilon,\omega_{j}(j=0,\cdot \cdot \cdot,n),k,MaxN
α,ε,ωj(j=0,⋅⋅⋅,n),k,MaxN
Step2: 任选样本
(
x
(
i
)
,
y
(
i
)
)
(x^{(i)},y^{(i)})
(x(i),y(i)),计算
d
k
(
ω
j
)
=
(
h
(
x
i
)
−
y
(
i
)
)
⋅
x
j
(
i
)
d_{k}(\omega_{j})=(h(x^{i})-y^{(i)})\cdot x^{(i)}_{j}
dk(ωj)=(h(xi)−y(i))⋅xj(i)
Step3: 迭代
ω
j
:
=
ω
j
−
α
d
k
(
ω
j
)
\omega_{j}:=\omega_{j}-\alpha d_{k}(\omega_{j})
ωj:=ωj−αdk(ωj)
k
=
k
+
1
k=k+1
k=k+1
Step4: 若
∣
∣
d
k
∣
∣
<
ε
||d_{k}||<\varepsilon
∣∣dk∣∣<ε或
k
>
M
a
x
N
k>MaxN
k>MaxN,输出
ω
j
\omega_{j}
ωj,否则转回Step2
注意
g
k
=
[
g
k
(
ω
0
)
g
k
(
ω
1
)
⋅
⋅
⋅
g
k
(
ω
n
)
]
g_{k}= \begin{bmatrix} g_{k}(\omega_{0})\\ g_{k}(\omega_{1})\\ \cdot \cdot \cdot\\ g_{k}(\omega_{n})\\ \end{bmatrix}
gk=⎣
⎡gk(ω0)gk(ω1)⋅⋅⋅gk(ωn)⎦
⎤
矩阵格式的梯度下降算法(整体梯度下降)
由前面的最小二乘法,已经推得
∂
J
(
ω
)
∂
ω
=
X
T
(
X
ω
−
Y
)
\frac{\partial J(\omega)}{\partial \omega}=X^{T}(X\omega-Y)
∂ω∂J(ω)=XT(Xω−Y)
Step1: 初始化
α
,
ε
,
ω
k
=
1
,
k
,
M
a
x
N
\alpha,\varepsilon,\omega_{k}=1,k,MaxN
α,ε,ωk=1,k,MaxN
Step2: 计算
d
k
(
ω
j
)
=
X
T
(
X
ω
−
Y
)
d_{k}(\omega_{j})=X^{T}(X\omega-Y)
dk(ωj)=XT(Xω−Y)
Step3: 迭代
ω
k
+
1
:
=
ω
k
−
α
d
k
(
ω
k
)
\omega_{k+1}:=\omega_{k}-\alpha d_{k}(\omega_{k})
ωk+1:=ωk−αdk(ωk)
k
=
k
+
1
k=k+1
k=k+1
Step4: 若
∣
∣
d
k
∣
∣
<
ε
||d_{k}||<\varepsilon
∣∣dk∣∣<ε或
k
>
M
a
x
N
k>MaxN
k>MaxN,输出
ω
j
\omega_{j}
ωj,否则转回Step2
import numpy as np
import pandas as pd
df=pd.read_csv('house_price.csv')
df.drop(['No'],axis=1,inplace=True)
df['price']=df['price'][:7].astype(int)
df=df.values
x=df[:7,0]
y=df[:7,1].reshape(-1,1)
m=y.size
'''标准化'''
def norm(x):
sigma = np.std(x, axis=0)
mu = np.mean(x, axis=0)
x = (x-mu)/sigma
return x, mu, sigma
x, mu, sigma = norm(x)
x = np.c_[np.ones(m), x]
'''梯度下降法'''
alpha=0.1
w=np.random.randint(1,1000,(2,1))
error=10**(-5)
k=0
def gradient_descent(x,y,w,alpha,error,k):
for i in range(300):
g=x.T@(x@w-y)
w=w-alpha*g
k+=1
if np.linalg.norm(g)<error:
break
print('迭代了{}次,w为{}'.format(k, w))
return w
w = gradient_descent(x,y,w,alpha,error,k)
#预测
x_predict=700
x_predict = (x_predict-mu)/sigma
y_predict=np.array([1,x_predict])@w
print('预测为{}'.format(y_predict))
最小二乘法与梯度下降法比较
梯度下降法: 需要选择学习率
α
\alpha
α,需要多次迭代。当特征数量
n
n
n大时也能较好适用,适用于各种类型的模型。
最小二乘法: 若特征数量
n
n
n较大则运算代价大,因为矩阵逆的计算时间复杂度为
O
(
n
3
)
O(n^{3})
O(n3),通常当
n
<
100000
n<100000
n<100000时还可以接受,只适用于线性模型,不适合logistic等其他模型。