注:线性回归的向量化表示见:线性回归向量化
在实际应用中为了计算更为方便,例如在编程中都是使用矩阵进行计算(参考 编程作业(2)逻辑回归),我们可以将整个模型向量化。
对于整个训练集而言:
1. 输入输出及参数
和线性回归一样,用 特征矩阵
X
X
X 来描述所有特征,用参数向量
θ
\theta
θ 来描述所有参数,用输出向量
y
y
y 表示所有输出变量:
X
=
[
x
0
(
1
)
x
1
(
1
)
x
2
(
1
)
⋅
⋅
⋅
x
n
(
1
)
x
0
(
2
)
x
1
(
2
)
x
2
(
2
)
⋅
⋅
⋅
x
n
(
2
)
:
:
:
⋅
⋅
⋅
:
x
0
(
m
)
x
1
(
m
)
x
2
(
m
)
⋅
⋅
⋅
x
n
(
m
)
]
,
θ
=
[
θ
0
θ
1
:
θ
n
]
,
y
=
[
y
(
1
)
y
(
2
)
:
y
(
m
)
]
X=\begin{bmatrix} x_0^{(1)}&x_1^{(1)}&x_2^{(1)}&···&x_n^{(1)}\\ \\ x_0^{(2)}&x_1^{(2)}&x_2^{(2)}&···&x_n^{(2)}\\ \\:&:&:&···&:\\ \\ x_0^{(m)}&x_1^{(m)}&x_2^{(m)}&···&x_n^{(m)}\\ \end{bmatrix}\ ,\ \theta=\begin{bmatrix} \theta_0\\ \\ \theta_1\\ \\:\\ \\ \theta_n \end{bmatrix}\ ,\ y=\begin{bmatrix} y^{(1)}\\ \\ y^{(2)}\\ \\:\\ \\ y^{(m)} \end{bmatrix}
X=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡x0(1)x0(2):x0(m)x1(1)x1(2):x1(m)x2(1)x2(2):x2(m)⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅xn(1)xn(2):xn(m)⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤ , θ=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎡θ0θ1:θn⎦⎥⎥⎥⎥⎥⎥⎥⎥⎤ , y=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎡y(1)y(2):y(m)⎦⎥⎥⎥⎥⎥⎥⎥⎥⎤
X
X
X 的维度是
m
∗
(
n
+
1
)
m*(n+1)
m∗(n+1) 且
x
0
=
1
x_0=1
x0=1,
θ
\theta
θ的维度为
(
n
+
1
)
∗
1
(n+1)*1
(n+1)∗1,
y
y
y 的维度为
m
∗
1
m*1
m∗1 且
y
(
i
)
=
0
,
1
y^{(i)}=0,1
y(i)=0,1
2. 假设函数
整个训练集 的 所有假设结果 也可以用一个
m
∗
1
m*1
m∗1 维的向量表示:
h
θ
(
x
)
=
g
(
X
θ
)
=
[
g
(
x
0
(
1
)
θ
0
+
x
1
(
1
)
θ
1
+
x
2
(
1
)
θ
2
+
⋅
⋅
⋅
+
x
n
(
1
)
θ
n
)
g
(
x
0
(
2
)
θ
0
+
x
1
(
2
)
θ
1
+
x
2
(
2
)
θ
2
+
⋅
⋅
⋅
+
x
n
(
2
)
θ
n
)
:
g
(
x
0
(
m
)
θ
0
+
x
1
(
m
)
θ
1
+
x
2
(
m
)
θ
2
+
⋅
⋅
⋅
+
x
n
(
m
)
θ
n
)
]
=
[
h
θ
(
x
(
1
)
)
h
θ
(
x
(
2
)
)
:
h
θ
(
x
(
m
)
)
]
=
y
^
=
[
y
^
(
1
)
y
^
(
2
)
:
y
^
(
m
)
]
h_\theta(x)=g(X\theta)=\begin{bmatrix} g(x_0^{(1)}\theta_0+x_1^{(1)}\theta_1+x_2^{(1)}\theta_2+···+x_n^{(1)}\theta_n)\\ \\ g(x_0^{(2)}\theta_0+x_1^{(2)}\theta_1+x_2^{(2)}\theta_2+···+x_n^{(2)}\theta_n)\\ \\:\\ \\ g(x_0^{(m)}\theta_0+x_1^{(m)}\theta_1+x_2^{(m)}\theta_2+···+x_n^{(m)}\theta_n)\\ \end{bmatrix}=\begin{bmatrix}h_\theta(x^{(1)})\\ \\ h_\theta(x^{(2)})\\ \\:\\ \\ h_\theta(x^{(m)}) \end{bmatrix}=\hat{y}=\begin{bmatrix}\hat{y}^{(1)}\\ \\ \hat{y}^{(2)}\\ \\:\\ \\ \hat{y}^{(m)} \end{bmatrix}
hθ(x)=g(Xθ)=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡g(x0(1)θ0+x1(1)θ1+x2(1)θ2+⋅⋅⋅+xn(1)θn)g(x0(2)θ0+x1(2)θ1+x2(2)θ2+⋅⋅⋅+xn(2)θn):g(x0(m)θ0+x1(m)θ1+x2(m)θ2+⋅⋅⋅+xn(m)θn)⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎡hθ(x(1))hθ(x(2)):hθ(x(m))⎦⎥⎥⎥⎥⎥⎥⎥⎥⎤=y^=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎡y^(1)y^(2):y^(m)⎦⎥⎥⎥⎥⎥⎥⎥⎥⎤这里引入的新符号(读作y帽)
y
^
=
h
θ
(
x
)
\hat{y}=h_\theta(x)
y^=hθ(x) ,有的地方也用
y
^
\hat{y}
y^ 来表示样本的预测值,跟假设函数
h
θ
(
x
)
h_\theta(x)
hθ(x)的含义其实一样。
3 代价函数
原始公式:
J
(
θ
)
=
−
1
m
∑
i
=
1
m
[
y
(
i
)
∗
log
(
h
θ
(
x
(
i
)
)
)
+
(
1
−
y
(
i
)
)
∗
log
(
1
−
h
θ
(
x
(
i
)
)
)
]
=
−
1
m
∑
i
=
1
m
[
y
(
i
)
∗
log
(
y
^
(
i
)
)
+
(
1
−
y
(
i
)
)
∗
log
(
1
−
y
^
(
i
)
)
]
\begin{aligned} J(θ)&=-\frac{1}{m}\sum_{i=1}^{m} \left[y^{(i)}*\log(h_θ(x^{(i)}))+(1-y^{(i)})*\log(1-h_θ( x^{(i)}))\right]\\ &=-\frac{1}{m}\sum_{i=1}^{m} \left[y^{(i)}*\log(\hat{y}^{(i)})+(1-y^{(i)})*\log(1-\hat{y}^{(i)})\right] \end{aligned}
J(θ)=−m1i=1∑m[y(i)∗log(hθ(x(i)))+(1−y(i))∗log(1−hθ(x(i)))]=−m1i=1∑m[y(i)∗log(y^(i))+(1−y(i))∗log(1−y^(i))]向量化表示为:
J
(
θ
)
=
−
1
m
S
U
M
[
y
∗
log
(
h
θ
(
x
)
)
+
(
1
−
y
)
∗
log
(
1
−
h
θ
(
x
)
)
]
=
−
1
m
S
U
M
[
y
∗
log
(
y
^
)
+
(
1
−
y
)
∗
log
(
1
−
y
^
)
]
\begin{aligned} J(θ)&=-\displaystyle\frac{1}{m} SUM \left[y*\log(h_\theta(x))+(1-y)*\log(1-h_\theta(x))\right]\\ &=-\displaystyle\frac{1}{m} SUM \left[y*\log(\hat{y})+(1-y)*\log(1-\hat{y})\right] \end{aligned}
J(θ)=−m1SUM[y∗log(hθ(x))+(1−y)∗log(1−hθ(x))]=−m1SUM[y∗log(y^)+(1−y)∗log(1−y^)] 上式中括号里的计算结果仍是一个向量,因此
S
U
M
SUM
SUM 表示对向量的所有项求和,最终得一个标量值。
1.4 梯度下降函数
原公式为:
θ
j
:
=
θ
j
−
α
1
m
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
x
j
(
i
)
\theta_j:=\theta_j-\alpha\frac{1}{m} \displaystyle\sum_{i=1}^{m} ( h_θ( x^{(i)} ) - y^{(i)})x_j^{(i)}
θj:=θj−αm1i=1∑m(hθ(x(i))−y(i))xj(i)现用向量来表示所有参数的更新过程:
θ
=
θ
−
α
δ
\theta=\theta-\alpha\delta
θ=θ−αδ其中:
θ
=
[
θ
0
θ
1
:
θ
n
]
,
δ
=
1
m
[
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
x
0
(
i
)
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
x
1
(
i
)
⋅
⋅
⋅
⋅
⋅
⋅
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
x
n
(
i
)
]
\theta=\begin{bmatrix} \theta_0\\ \\ \theta_1\\ \\:\\ \\ \theta_n \end{bmatrix}\ \ ,\ \ \delta=\frac{1}{m} \begin{bmatrix} \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})x_0^{(i)}\\ \\ \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})x_1^{(i)}\\ \\······\\ \\ \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})x_n^{(i)} \end{bmatrix}
θ=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎡θ0θ1:θn⎦⎥⎥⎥⎥⎥⎥⎥⎥⎤ , δ=m1⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡∑i=1m(hθ(x(i))−y(i))x0(i)∑i=1m(hθ(x(i))−y(i))x1(i)⋅⋅⋅⋅⋅⋅∑i=1m(hθ(x(i))−y(i))xn(i)⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤又因为:
δ
=
1
m
[
x
0
(
1
)
x
0
(
2
)
⋅
⋅
⋅
x
0
(
m
)
x
1
(
1
)
x
1
(
2
)
⋅
⋅
⋅
x
1
(
m
)
:
:
⋅
⋅
⋅
:
x
0
(
1
)
x
0
(
2
)
⋅
⋅
⋅
x
0
(
m
)
]
[
h
θ
(
x
(
1
)
)
−
y
(
1
)
h
θ
(
x
(
2
)
)
−
y
(
2
)
⋅
⋅
⋅
⋅
⋅
⋅
h
θ
(
x
(
m
)
)
−
y
(
m
)
]
=
1
m
X
T
[
g
(
X
θ
)
−
y
]
\delta=\frac{1}{m} \begin{bmatrix} x_0^{(1)}&x_0^{(2)}&···&x_0^{(m)}\\ \\ x_1^{(1)}&x_1^{(2)}&···&x_1^{(m)}\\ \\:&:&···&:\\ \\ x_0^{(1)}&x_0^{(2)}&···&x_0^{(m)}\\ \end{bmatrix} \begin{bmatrix} h_\theta(x^{(1)})-y^{(1)}\\ \\ h_\theta(x^{(2)})-y^{(2)}\\ \\······\\ \\ h_\theta(x^{(m)})-y^{(m)} \end{bmatrix}=\frac{1}{m}X^T\left [ g(X\theta)-y \right]
δ=m1⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡x0(1)x1(1):x0(1)x0(2)x1(2):x0(2)⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅x0(m)x1(m):x0(m)⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤⎣⎢⎢⎢⎢⎢⎢⎢⎢⎡hθ(x(1))−y(1)hθ(x(2))−y(2)⋅⋅⋅⋅⋅⋅hθ(x(m))−y(m)⎦⎥⎥⎥⎥⎥⎥⎥⎥⎤=m1XT[g(Xθ)−y]因此,梯度下降可以表示为:
θ
=
θ
−
α
1
m
X
T
[
g
(
X
θ
)
−
y
]
\theta=\theta-\alpha\frac{1}{m}X^T\left [ g(X\theta)-y \right]
θ=θ−αm1XT[g(Xθ)−y]