# 交叉熵代价函数(损失函数)及其求导推导

1. 前言
2. 交叉熵损失函数
3. 交叉熵损失函数的求导

## 前言

J(θ)=1mi=1my(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i))),J(θ)=−1m∑i=1my(i)log⁡(hθ(x(i)))+(1−y(i))log⁡(1−hθ(x(i))),

θjJ(θ)=1mi=1m(hθ(x(i))y(i))x(i)j∂∂θjJ(θ)=1m∑i=1m(hθ(x(i))−y(i))xj(i)

## 交叉熵损失函数

• logistic回归（是非问题）中，y(i)y(i)取0或者1；
• softmax回归（多分类问题）中，y(i)y(i)取1,2…k中的一个表示类别标号的一个数（假设共有k类）。

θTx(i):=θ0+θ1x(i)1++θpx(i)p.θTx(i):=θ0+θ1x1(i)+⋯+θpxp(i).

hθ(x(i))=11+eθTx(i)hθ(x(i))=11+e−θTx(i)

P(y^(i)=1|x(i);θ)=hθ(x(i))P(y^(i)=1|x(i);θ)=hθ(x(i))
P(y^(i)=0|x(i);θ)=1hθ(x(i))P(y^(i)=0|x(i);θ)=1−hθ(x(i))

logP(y^(i)=1|x(i);θ)=loghθ(x(i))=log11+eθTx(i),log⁡P(y^(i)=1|x(i);θ)=log⁡hθ(x(i))=log⁡11+e−θTx(i),
logP(y^(i)=0|x(i);θ)=log(1hθ(x(i)))=logeθTx(i)1+eθTx(i).log⁡P(y^(i)=0|x(i);θ)=log⁡(1−hθ(x(i)))=log⁡e−θTx(i)1+e−θTx(i).

I{y(i)=1}logP(y^(i)=1|x(i);θ)+I{y(i)=0}logP(y^(i)=0|x(i);θ)=y(i)logP(y^(i)=1|x(i);θ)+(1y(i))logP(y^(i)=0|x(i);θ)=y(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i)))I{y(i)=1}log⁡P(y^(i)=1|x(i);θ)+I{y(i)=0}log⁡P(y^(i)=0|x(i);θ)=y(i)log⁡P(y^(i)=1|x(i);θ)+(1−y(i))log⁡P(y^(i)=0|x(i);θ)=y(i)log⁡(hθ(x(i)))+(1−y(i))log⁡(1−hθ(x(i)))

i=1my(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i)))∑i=1my(i)log⁡(hθ(x(i)))+(1−y(i))log⁡(1−hθ(x(i)))

J(θ)=1mi=1my(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i)))J(θ)=−1m∑i=1my(i)log⁡(hθ(x(i)))+(1−y(i))log⁡(1−hθ(x(i)))

## 交叉熵损失函数的求导

logab=logalogb  log⁡ab=log⁡a−log⁡b
loga+logb=log(ab)  log⁡a+log⁡b=log⁡(ab)
a=logea  a=log⁡ea

J(θ)=1mi=1my(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i)))J(θ)=−1m∑i=1my(i)log⁡(hθ(x(i)))+(1−y(i))log⁡(1−hθ(x(i)))

loghθ(x(i))=log11+eθTx(i)=log(1+eθTx(i)) ,log(1hθ(x(i)))=log(111+eθTx(i))=log(eθTx(i)1+eθTx(i))=log(eθTx(i))log(1+eθTx(i))=θTx(i)log(1+eθTx(i)) .log⁡hθ(x(i))=log⁡11+e−θTx(i)=−log⁡(1+e−θTx(i)) ,log⁡(1−hθ(x(i)))=log⁡(1−11+e−θTx(i))=log⁡(e−θTx(i)1+e−θTx(i))=log⁡(e−θTx(i))−log⁡(1+e−θTx(i))=−θTx(i)−log⁡(1+e−θTx(i))①③ .

J(θ)=1mi=1m[y(i)(log(1+eθTx(i)))+(1y(i))(θTx(i)log(1+eθTx(i)))]=1mi=1m[y(i)θTx(i)θTx(i)log(1+eθTx(i))]=1mi=1m[y(i)θTx(i)logeθTx(i)log(1+eθTx(i))]=1mi=1m[y(i)θTx(i)(logeθTx(i)+log(1+eθTx(i)))]=1mi=1m[y(i)θTx(i)log(1+eθTx(i))]J(θ)=−1m∑i=1m[−y(i)(log⁡(1+e−θTx(i)))+(1−y(i))(−θTx(i)−log⁡(1+e−θTx(i)))]=−1m∑i=1m[y(i)θTx(i)−θTx(i)−log⁡(1+e−θTx(i))]=−1m∑i=1m[y(i)θTx(i)−log⁡eθTx(i)−log⁡(1+e−θTx(i))]③=−1m∑i=1m[y(i)θTx(i)−(log⁡eθTx(i)+log⁡(1+e−θTx(i)))]②=−1m∑i=1m[y(i)θTx(i)−log⁡(1+eθTx(i))]

θjJ(θ)=θj(1mi=1m[log(1+eθTx(i))y(i)θTx(i)])=1mi=1m[θjlog(1+eθTx(i))θj(y(i)θTx(i))]=1mi=1mx(i)jeθTx(i)1+eθTx(i)y(i)x(i)j=1mi=1m(hθ(x(i))y(i))x(i)j∂∂θjJ(θ)=∂∂θj(1m∑i=1m[log⁡(1+eθTx(i))−y(i)θTx(i)])=1m∑i=1m[∂∂θjlog⁡(1+eθTx(i))−∂∂θj(y(i)θTx(i))]=1m∑i=1m(xj(i)eθTx(i)1+eθTx(i)−y(i)xj(i))=1m∑i=1m(hθ(x(i))−y(i))xj(i)

θjJ(θ)=1mi=1m(hθ(x(i))y(i))x(i)j∂∂θjJ(θ)=1m∑i=1m(hθ(x(i))−y(i))xj(i)