Reference:
https://en.wikipedia.org/wiki/Cross_entropy
https://d2l.ai/chapter_linear-networks/softmax-regression.html#loss-function
Definition: Cross-Entropy
The cross-entropy of the distribution q q q relative to a distribution p p p over a given set is defined as follows:
H ( p , q ) = − E p [ log q ] (1) H(p,q)=-E_p[\log q] \tag{1} H(p,q)=−Ep[logq](1)
where E p [ ⋅ ] E_p[\cdot] Ep[⋅] is the expected value operator with respect to the distribution p p p.
For discrete probability distribution p p p and q q q with the same support X \mathcal X X this means:
H ( p , q ) = − ∑ x ∈ X p ( x ) log q ( x ) (2) H(p,q)=-\sum_{x\in \mathcal X}p(x)\log q(x)\tag{2} H(p,q)=−x∈X∑p(x)logq(x)(2)
The situation for continuous distributions is analogous:
H ( p , q ) = − ∫ X P ( x ) log Q ( x ) d r ( x ) (3) H(p,q)=-\int _\mathcal X P(x)\log Q(x)dr(x)\tag{3} H(p,q)=−∫XP(x)logQ(x)dr(x)(3)
N.B: The notation H ( p , q ) H(p,q) H(p,q) is also used for the joint entropy of p p p and q q q.
Relation to Log-likelihood
In classification problems we want to estimate the probability of different outcomes. Suppose that the entire dataset { X , y } \{\mathbf X, \mathbf y\} {
X,y} has N N N samples, where the sample indexed by i i i consists of a feature vector x ( i ) \mathbf x^{(i)} x(i) and a label y ( i ) y^{(i)} y(i). Let the estimated probability of outcome k ∈ K k\in \mathcal K k∈K be p ^ ( y = k ∣ x ; w ) \hat p(y=k|\mathbf x;\mathbf w) p^(y=k∣x;w) and let the frequency (empirical probability) of outcome k k k in the training set be q ( y = k ∣ x ) q(y=k|\mathbf x) q(y=k∣x). The likelihood of the parameters w \mathbf w w is
L ( w ) = ∏ k ∈ K ( est. prob. of k ) num. of oc