UFLDL是吴恩达团队编写的较早的一门深度学习入门,里面理论加上练习的节奏非常好,每次都想快点看完理论去动手编写练习,因为他帮你打好了整个代码框架,也有详细的注释,所以我们只要实现一点核心的代码编写工作就行了,上手快!
我这里找不到新版对应这块的中文翻译了,-_-
第五节是:Softmax Regression(softmax回归),下面我会同时提供向量化与非向量化的代码。
softmax回归其实是逻辑回归的扩展,换句话说,逻辑回归是softmax回归的特例(即softmax回归中k=2)。逻辑回归通常用作2类的分类器,softmax则用作多类的分类器(不要问为什么它还叫作“Regression”)。
这是之前逻辑回归的hypothesis:
h
θ
(
x
)
=
1
1
+
exp
(
−
θ
⊤
x
)
h_{\theta}(x)=\frac{1}{1+\exp \left(-\theta^{\top} x\right)}
hθ(x)=1+exp(−θ⊤x)1以及对应的损失函数:
J
(
θ
)
=
−
[
∑
i
=
1
m
y
(
i
)
log
h
θ
(
x
(
i
)
)
+
(
1
−
y
(
i
)
)
log
(
1
−
h
θ
(
x
(
i
)
)
)
]
J(\theta)=-\left[\sum_{i=1}^{m} y^{(i)} \log h_{\theta}\left(x^{(i)}\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\right]
J(θ)=−[i=1∑my(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i)))]
现在的softmax回归的hypothesis:
h
θ
(
x
)
=
[
P
(
y
=
1
∣
x
;
θ
)
P
(
y
=
2
∣
x
;
θ
)
⋮
P
(
y
=
K
∣
x
;
θ
)
]
=
1
∑
j
=
1
K
exp
(
θ
(
j
)
⊤
x
)
[
exp
(
θ
(
1
)
⊤
x
)
exp
(
θ
(
2
)
⊤
x
)
⋮
exp
(
θ
(
K
)
⊤
x
)
]
h_{\theta}(x)=\left[\begin{array}{c}{P(y=1 | x ; \theta)} \\ {P(y=2 | x ; \theta)} \\ {\vdots} \\ {P(y=K | x ; \theta)}\end{array}\right]=\frac{1}{\sum_{j=1}^{K} \exp \left(\theta^{(j) \top} x\right)}\left[\begin{array}{c}{\exp \left(\theta^{(1) \top} x\right)} \\ {\exp \left(\theta^{(2) \top} x\right)} \\ {\vdots} \\ {\exp \left(\theta^{(K) \top} x\right)}\end{array}\right]
hθ(x)=⎣⎢⎢⎢⎡P(y=1∣x;θ)P(y=2∣x;θ)⋮P(y=K∣x;θ)⎦⎥⎥⎥⎤=∑j=1Kexp(θ(j)⊤x)1⎣⎢⎢⎢⎡exp(θ(1)⊤x)exp(θ(2)⊤x)⋮exp(θ(K)⊤x)⎦⎥⎥⎥⎤
以及对应的损失函数:
J
(
θ
)
=
−
[
∑
i
=
1
m
∑
k
=
1
K
1
{
y
(
i
)
=
k
}
log
exp
(
θ
(
k
)
⊤
x
(
i
)
)
∑
j
=
1
K
exp
(
θ
(
j
)
⊤
x
(
i
)
)
]
J(\theta)=-\left[\sum_{i=1}^{m} \sum_{k=1}^{K} 1\left\{y^{(i)}=k\right\} \log \frac{\exp \left(\theta^{(k) \top} x^{(i)}\right)}{\sum_{j=1}^{K} \exp \left(\theta^{(j) \top} x^{(i)}\right)}\right]
J(θ)=−[i=1∑mk=1∑K1{y(i)=k}log∑j=1Kexp(θ(j)⊤x(i))exp(θ(k)⊤x(i))]
直观上理解就是每个score占总score的比值就相当于一个“概率”,具体的概率性解释可以去看看cs229 note吧。其中的1
{
⋅
}
\{\cdot\}
{⋅}是一个开关函数(自己乱叫的),里面为真时其值为1,里面为假时其值为0,你在向量化编程时就会发现,Y其实是一个比较稀疏的矩阵。
而softmax回归的梯度公式是这样的(这个目前还不知道怎么推导过来的,有知道的可以指点一下):
∇
θ
(
k
)
J
(
θ
)
=
−
∑
i
=
1
m
[
x
(
i
)
(
1
{
y
(
i
)
=
k
}
−
P
(
y
(
i
)
=
k
∣
x
(
i
)
;
θ
)
)
]
\nabla_{\theta^{(k)}} J(\theta)=-\sum_{i=1}^{m}\left[x^{(i)}\left(1\left\{y^{(i)}=k\right\}-P\left(y^{(i)}=k | x^{(i)} ; \theta\right)\right)\right]
∇θ(k)J(θ)=−i=1∑m[x(i)(1{y(i)=k}−P(y(i)=k∣x(i);θ))]
与之前不同的是,之前关于
θ
\theta
θ的梯度都是一个向量,而现在是一个N * K维(N是特征数,K是类别数)矩阵(其实按这样讲的话,逻辑回归里面二分类也应该是一个矩阵,但是因为其有redundant Properties ,这在教程里也有提到,可以减少一维,所以是一个向量,这是我自己的理解)
Talk is cheap,give me the code!
里面新学到一个函数sub2ind
function [f,g] = softmax_regression_vec(theta, X,y)
%
% Arguments:
% theta - A vector containing the parameter values to optimize.
% In minFunc, theta is reshaped to a long vector. So we need to
% resize it to an n-by-(num_classes-1) matrix.
% Recall that we assume theta(:,num_classes) = 0.
%
% X - The examples stored in a matrix.
% X(i,j) is the i'th coordinate of the j'th example.
% y - The label for each example. y(j) is the j'th example's label.
%
m=size(X,2);
n=size(X,1);
% theta is a vector; need to reshape to n x num_classes.
theta=reshape(theta, n, []);
num_classes=size(theta,2)+1;
% initialize objective value and gradient.
% f = 0;
% g = zeros(size(theta));
%
% TODO: Compute the softmax objective function and gradient using vectorized code.
% Store the objective function value in 'f', and the gradient in 'g'.
% Before returning g, make sure you form it back into a vector with g=g(:);
%
%%% YOUR CODE HERE %%%
A = exp([theta' * X;zeros(1,m)]);
B = bsxfun(@rdivide, A, sum(A));
C = log(B);
I = sub2ind(size(C),y,1:size(C,2));
f = (-1) * sum(C(I));
%%%%%%% calculate g %%%%%%%%%%%%
Y = repmat(y',1,num_classes);
for i=1:num_classes
Y(Y(:,i)~=i,i) = 0;
end
Y(Y~=0)=1;
% 这里去掉Y的一列,B的一行是因为theta只有num_classes-1列
g = (-1) * X * (Y(:,1:(size(Y,2)-1))-B(1:(size(B,1)-1),:)');
%%% 别人的写法,两种写法效果一样,主要是稀疏矩阵生成不一样一点,他的速度略快%%
%%% 因为这里num_classes还很小,我耗时0.014272秒,他的耗时0.014249秒 %%%
% h = theta'*X;%h(k,i)第k个theta,第i个样本
% a = exp(h);
% a = [a;ones(1,size(a,2))];%加1行
% p = bsxfun(@rdivide,a,sum(a));
% c = log2(p);
% i = sub2ind(size(c), y,[1:size(c,2)]);
% values = c(i);
% f = -sum(values);
% d = full(sparse(1:m,y,1));
% d = d(:,1:(size(d,2)-1));%减1列
% p = p(1:(size(p,1)-1),:);%减1行
% g = X*(p'.-d);
g=g(:); % make gradient a vector for minFunc
运行结果:
编写完怕自己出错,和别人比较了一下,参考出处:https://blog.csdn.net/lingerlanlan/article/details/38425929
有理解不到位之处,还请指出,有更好的想法,可以在下方评论交流!