【深度学习笔记】5 softmax回归

Softmax 回归数学表达

在这里插入图片描述
我们考虑单个样本 x ( i ) = ( x 1 ( 1 ) , x 2 ( 1 ) , x 3 ( 1 ) , x 4 ( 1 ) ) x^{(i)}=(x^{(1)}_1,x^{(1)}_2,x^{(1)}_3,x^{(1)}_{4}) x(i)=(x1(1),x2(1),x3(1),x4(1)),我们有
o j ( i ) = x 1 ( i ) ∗ w 1 , j + x 2 ( i ) ∗ w 2 , j + x 3 ( i ) ∗ w 3 , j ( i ) + x 4 ( i ) ∗ w 4 , j ( i ) + b j , j = 1 , 2 , 3 o^{(i)}_j=x^{(i)}_1*w_{1,j}+x^{(i)}_2*w_{2,j}+x^{(i)}_{3}*w^{(i)}_{3,j}+x^{(i)}_{4}*w^{(i)}_{4,j}+b_j,j=1,2,3 oj(i)=x1(i)w1,j+x2(i)w2,j+x3(i)w3,j(i)+x4(i)w4,j(i)+bj,j=1,2,3
[ o 1 ( i ) o 2 ( i ) o 3 ( i ) ] = [ x 1 ( i ) ∗ w 1 , 1 + x 2 ( i ) ∗ w 2 , 1 + x 3 ( i ) ∗ w 3 , 1 + x 4 ( i ) ∗ w 4 , 1 + b 1 x 1 ( i ) ∗ w 1 , 2 + x 2 ( i ) ∗ w 2 , 2 + x 3 ( i ) ∗ w 3 , 2 + x 4 ( i ) ∗ w 4 , 2 + b 2 x 1 ( i ) ∗ w 1 , 3 + x 2 ( i ) ∗ w 2 , 3 + x 3 ( i ) ∗ w 3 , 3 + x 4 ( i ) ∗ w 4 , 3 + b 3 ] = [ x 1 ( i ) x 2 ( i ) x 3 ( i ) x 4 ( i ) ] [ w 1 , 1 w 1 , 2 w 1 , 3 w 2 , 1 w 2 , 2 w 2 , 3 w 3 , 1 w 3 , 2 w 3 , 3 w 4 , 1 w 4 , 2 w 4 , 3 ] + [ b 1 b 2 b 3 ] \begin{aligned} \begin{bmatrix} o^{(i)}_1 & o^{(i)}_2 & o^{(i)}_3 \end{bmatrix}&= \begin{bmatrix} x^{(i)}_1*w_{1,1}+x^{(i)}_2*w_{2,1}+x^{(i)}_{3}*w_{3,1}+x^{(i)}_{4}*w_{4,1} +b_1& x^{(i)}_1*w_{1,2}+x^{(i)}_2*w_{2,2}+x^{(i)}_{3}*w_{3,2}+x^{(i)}_{4}*w_{4,2} +b_2& x^{(i)}_1*w_{1,3}+x^{(i)}_2*w_{2,3}+x^{(i)}_{3}*w_{3,3}+x^{(i)}_{4}*w_{4,3} +b_3 \end{bmatrix} \\ &= \begin{bmatrix} x^{(i)}_1 &x^{(i)}_2 & x^{(i)}_3 & x^{(i)}_4 \end{bmatrix} \begin{bmatrix} w_{1,1} & w_{1,2} & w_{1,3} \\ w_{2,1} & w_{2,2} & w_{2,3} \\ w_{3,1} & w_{3,2} & w_{3,3} \\ w_{4,1} & w_{4,2} & w_{4,3} \end{bmatrix}+\begin{bmatrix}b_1 & b_2 & b_3\end{bmatrix} \end{aligned} [o1(i)o2(i)o3(i)]=[x1(i)w1,1+x2(i)w2,1+x3(i)w3,1+x4(i)w4,1+b1x1(i)w1,2+x2(i)w2,2+x3(i)w3,2+x4(i)w4,2+b2x1(i)w1,3+x2(i)w2,3+x3(i)w3,3+x4(i)w4,3+b3]=[x1(i)x2(i)x3(i)x4(i)]w1,1w2,1w3,1w4,1w1,2w2,2w3,2w4,2w1,3w2,3w3,3w4,3+[b1b2b3] 若有n个样本则
[ o 1 ( 1 ) o 2 ( 1 ) o 3 ( 1 ) o 1 ( 2 ) o 2 ( 2 ) o 3 ( 2 ) ⋮ ⋮ ⋮ o 1 ( n ) o 2 ( n o 3 ( n ) ] = [ x 1 ( 1 ) x 2 ( 1 x 3 ( 1 ) x 4 ( 1 ) x 1 ( 2 ) x 2 ( 2 ) x 3 ( 2 ) x 4 ( 2 ) ⋮ ⋮ ⋮ ⋮ x 1 ( n ) x 2 ( n ) x 3 ( n ) x 4 ( n ) ] [ w 1 , 1 w 1 , 2 w 1 , 3 w 2 , 1 w 2 , 2 w 2 , 3 w 3 , 1 w 3 , 2 w 3 , 3 w 4 , 1 w 4 , 2 w 4 , 3 ] + [ b 1 b 2 b 3 b 1 b 2 b 3 ⋮ ⋮ ⋮ b 1 b 2 b 3 ] \begin{bmatrix} o^{(1)}_1 & o^{(1)}_2 & o^{(1)}_3 \\ o^{(2)}_1 & o^{(2)}_2 & o^{(2)}_3 \\ \vdots & \vdots & \vdots\\ o^{(n)}_1 & o^{(n}_2 & o^{(n)}_3 \end{bmatrix}=\begin{bmatrix} x^{(1)}_1 &x^{(1}_2 & x^{(1)}_3 & x^{(1)}_4 \\ x^{(2)}_1 &x^{(2)}_2 & x^{(2)}_3 & x^{(2)}_4 \\ \vdots & \vdots & \vdots& \vdots\\ x^{(n)}_1 &x^{(n)}_2 & x^{(n)}_3 & x^{(n)}_4 \end{bmatrix}\begin{bmatrix} w_{1,1} & w_{1,2} & w_{1,3} \\ w_{2,1} & w_{2,2} & w_{2,3} \\ w_{3,1} & w_{3,2} & w_{3,3} \\ w_{4,1} & w_{4,2} & w_{4,3} \end{bmatrix}+\begin{bmatrix} b_1 & b_2 & b_3 \\ b_1 & b_2 & b_3 \\ \vdots &\vdots & \vdots\\ b_1 & b_2 & b_3 \end{bmatrix} o1(1)o1(2)o1(n)o2(1)o2(2)o2(no3(1)o3(2)o3(n)=x1(1)x1(2)x1(n)x2(1x2(2)x2(n)x3(1)x3(2)x3(n)x4(1)x4(2)x4(n)w1,1w2,1w3,1w4,1w1,2w2,2w3,2w4,2w1,3w2,3w3,3w4,3+b1b1b1b2b2b2b3b3b3

O = X W + B O=XW+B O=XW+B

其中 O ∈ R n × 3 , X ∈ R n × 4 , W ∈ R 4 × 3 , b ∈ R 3 ( b r o a d c a s t ) O \in R^{n×3},X \in R^{n×4},W \in R^{4×3},b \in R^3(broadcast) ORn×3XRn×4WR4×3bR3(broadcast)
O ∈ R n × q , X ∈ R n × p , W ∈ R p × q , b ∈ R 3 ( b r o a d c a s t ) O \in R^{n×q},X \in R^{n×p},W \in R^{p×q},b \in R^3(broadcast) ORn×qXRn×pWRp×qbR3(broadcast)

Softmax

y ^ 1 ( i ) , y ^ 2 ( i ) , y ^ 3 ( i ) = s o f t m a x ( o 1 ( i ) , o 2 ( i ) , o 3 ( i ) ) \hat{y}^{(i)}_1,\hat{y}^{(i)}_2,\hat{y}^{(i)}_3=softmax(o^{(i)}_1,o^{(i)}_2,o^{(i)}_3) y^1(i),y^2(i),y^3(i)=softmax(o1(i),o2(i),o3(i)) 其中
y ^ 1 ( i ) = e x p ( o 1 ( i ) ) ∑ j = 1 3 e x p ( o j ( i ) ) y ^ 2 ( i ) = e x p ( o 1 ( i ) ) ∑ j = 1 3 e x p ( o j ( i ) ) y ^ 3 ( i ) = e x p ( o 1 ( i ) ) ∑ j = 1 3 e x p ( o j ( i ) ) \begin{matrix} \hat{y}^{(i)}_1=\dfrac{exp(o^{(i)}_1)}{\sum^3_{j=1}exp(o^{(i)}_j)} & \hat{y}^{(i)}_2=\dfrac{exp(o^{(i)}_1)}{\sum^3_{j=1}exp(o^{(i)}_j)} & \hat{y}^{(i)}_3=\dfrac{exp(o^{(i)}_1)}{\sum^3_{j=1}exp(o^{(i)}_j)} & \end{matrix} y^1(i)=j=13exp(oj(i))exp(o1(i))y^2(i)=j=13exp(oj(i))exp(o1(i))y^3(i)=j=13exp(oj(i))exp(o1(i))

y ^ 1 ( i ) + y ^ 2 ( i ) + y ^ 3 ( i ) = 1   a n d   y ^ 1 ( i ) , y ^ 1 ( i ) , y ^ 1 ( i ) ∈ [ 0 , 1 ] \hat{y}^{(i)}_1+\hat{y}^{(i)}_2+\hat{y}^{(i)}_3=1 \ and \ \hat{y}^{(i)}_1,\hat{y}^{(i)}_1,\hat{y}^{(i)}_1 \in [0,1] y^1(i)+y^2(i)+y^3(i)=1 and y^1(i)y^1(i)y^1(i)[0,1]
满足概率的条件,概率大的为预测出的的类别。
O = X W + B , Y ^ = s o f t m a x ( O ) O=XW+B,\hat{Y}=softmax(O) O=XW+B,Y^=softmax(O)其中:
O ∈ R n × q , X ∈ R n × p , W ∈ R p × q , b ∈ R q ( b r o a d c a s t ) O \in R^{n×q},X \in R^{n×p},W \in R^{p×q},b \in R^q(broadcast) ORn×qXRn×pWRp×qbRq(broadcast)
n 是batch 数,p 是特征数,q是类别数, Y ^ ( i ) \hat{Y}^{(i)} Y^(i)中第j列表示属于第j个类别的概率

交叉熵损失函数

one-hot 编码

对于第 i i i 个样本
( x 1 ( 1 ) , x 2 ( 1 ) , x 3 ( 1 ) , … , x q ( 1 ) , y ( i ) ) (x^{(1)}_1,x^{(1)}_2,x^{(1)}_3,\dots,x^{(1)}_q,\color{red} y^{(i)}) (x1(1),x2(1),x3(1),,xq(1),y(i))
(实际中注意下角标从0开始)假设 y ( i ) = 3 \color{red} y^{(i)}=3 y(i)=3,那么我们可以构建一个长度为q(类别数)的向量 y ⃗ ( i ) \vec{y}^{(i)} y (i),将其中第3个分量(第 y ( i ) \color{red} y^{(i)} y(i)个分量)设为1,其他分量设为 0
在这里插入图片描述

交叉熵

交叉熵(cross entropy)如下:
H ( y ⃗ ( i ) , y ^ ( i ) ) = − ∑ j = 1 n y j ( i ) l o g y ^ j ( i ) = − l o g y ^ y ( i ) ( i ) H(\vec{y}^{(i)},\hat{y}^{(i)})=-\sum^n_{j=1}y^{(i)}_j log \hat{y}^{(i)}_j=-log\hat{y}^{(i)}_{\color{red}y^{(i)}} H(y (i),y^(i))=j=1nyj(i)logy^j(i)=logy^y(i)(i)

交叉熵表示什么?首先看KL散度(这个是非负数)
D K L ( P ( x ) , Q ( x ) ) = ∑ j P ( x j ) l o g P ( x j ) Q ( x j ) = ∑ j P ( x j ) l o g P ( x j ) − ∑ j P ( x ) l o g Q ( x j ) = − H ( P ( x ) ) ⏟ 熵 + − ∑ j P ( x ) l o g Q ( x j ) \begin{aligned} D_{KL}(P(x),Q(x))&=\sum_jP(x_j)log\dfrac{P(x_j)}{Q(x_j)} \\ &=\sum_j P(x_j)logP(x_j)-\sum_jP(x)logQ(x_j) \\ &=-\underbrace{H(P(x))}_{\text{熵}}+\boxed{-\sum_jP(x)logQ(x_j)} \end{aligned} DKL(P(x),Q(x))=jP(xj)logQ(xj)P(xj)=jP(xj)logP(xj)jP(x)logQ(xj)= H(P(x))+jP(x)logQ(xj)

KL散度描述两个分布的相似程度,越小表示越接近,P(x) 一般表示我们数据集里样本的分布,他的熵是固定的,而Q(x)是我们模型产生的预测分布,我们自然希望与P(x)越接近越好。

n个样本的损失函数为:
l ( Θ ) = 1 n ∑ j = 1 n H ( y ⃗ ( i ) , y ^ ( i ) ) l(\Theta)=\dfrac 1 n \sum^n_{j=1}H(\vec{y}^{(i)},\hat{y}^{(i)}) l(Θ)=n1j=1nH(y (i),y^(i))
如果每个样本只有一个标签:
l ( Θ ) = − 1 n ∑ i = 1 n l o g y ^ y ( i ) ( i ) l(\Theta)=- \dfrac 1 n \sum^n_{i=1}log\hat{y}^{(i)}_{\color{red}y^{(i)}} l(Θ)=n1i=1nlogy^y(i)(i)
从另一个角度看,我们知道最小化 l ( Θ ) l(\Theta) l(Θ) 等价于最大化

e x p ( − n l ( Θ ) ) = ∏ i = 1 n l o g y ^ y ( i ) ( i ) exp(-nl(\Theta))= \textstyle \prod_{i=1}^n log\hat{y}^{(i)}_{\color{red}y^{(i)}} exp(nl(Θ))=i=1nlogy^y(i)(i)
即最小化交叉熵损失函数等于最大化训练数据集所有标签类别的联合预测概率。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值