目录
Introduction
- We propose a novel, projection based way to incorporate the conditional information into the discriminator of GANs that respects the role of the conditional information in the underlining probabilistic model (i.e., a function that measures the information theoretic distance between the generative distribution and the target distribution).
- By construction, any assumption about the form of the distribution would act as a regularization on the choice of the discriminator. In this paper, we propose a specific form of the discriminator, a form motivated by a probabilistic model in which the distribution of the conditional variable y y y given x x x is discrete or uni-modal continuous distributions.
- As we will explain in the next section, adhering to this assumption will give rise to a structure of the discriminator that requires us to take an inner product between the embedded condition vector
y
y
y and the feature vector (Figure 1d).
The Architecture of the cGAN Discriminator with a Probablistic Model Assumptions
Notation
- x \boldsymbol x x: input vector
- y \boldsymbol y y: conditional information (When y \boldsymbol y y is discrete label information, we can assume that it is encoded as a one-hot vector.)
- D ( x , y ; θ ) : = A ( f ( x , y ; θ ) ) D(\boldsymbol x, \boldsymbol y; θ) := \mathcal A(f(\boldsymbol x, \boldsymbol y; θ)) D(x,y;θ):=A(f(x,y;θ)): cGAN discriminator, where A \mathcal A A is an activation function
- q q q: the true distributions
- p p p: the generated distributions
f ∗ ( x , y ) f^*(x,y) f∗(x,y)
- The standard adversarial loss for the discriminator is given by:
with A \mathcal A A in D D D representing the sigmoid function.
- 类似 GAN 中的推导,假设
D
D
D 可以表示任意函数,则可以推导出 optimal discriinator
D
∗
(
x
,
y
)
D^*(x,y)
D∗(x,y):
D ∗ ( x , y ) = q ( x , y ) q ( x , y ) + p ( x , y ) D^*(x,y)=\frac{q(x,y)}{q(x,y)+p(x,y)} D∗(x,y)=q(x,y)+p(x,y)q(x,y)由于现在假设激活函数为 sigmoid,因此有
A ( f ( x , y ; θ ) ) = 1 1 + exp ( − f ∗ ( x , y ) ) = D ∗ ( x , y ) = q ( x , y ) q ( x , y ) + p ( x , y ) \mathcal A(f(x,y;\theta))=\frac{1}{1+\exp(-f^*(x,y))}=D^*(x,y)=\frac{q(x,y)}{q(x,y)+p(x,y)} A(f(x,y;θ))=1+exp(−f∗(x,y))1=D∗(x,y)=q(x,y)+p(x,y)q(x,y)因此有
f ∗ ( x , y ) = log q ( x , y ) p ( x , y ) = log q ( x ∣ y ) q ( y ) p ( x ∣ y ) p ( y ) = log q ( y ∣ x ) p ( y ∣ x ) + log q ( x ) p ( x ) : = r ( y ∣ x ) + r ( x ) f^*(\boldsymbol x,\boldsymbol y)=\log\frac{q(\boldsymbol x,\boldsymbol y)}{p(\boldsymbol x,\boldsymbol y)}=\log \frac{q(\boldsymbol{x} \mid \boldsymbol{y}) q(\boldsymbol{y})}{p(\boldsymbol{x} \mid \boldsymbol{y}) p(\boldsymbol{y})}=\log \frac{q(\boldsymbol{y} \mid \boldsymbol{x})}{p(\boldsymbol{y} \mid \boldsymbol{x})}+\log \frac{q(\boldsymbol{x})}{p(\boldsymbol{x})}:=r(\boldsymbol{y} \mid \boldsymbol{x})+r(\boldsymbol{x}) f∗(x,y)=logp(x,y)q(x,y)=logp(x∣y)p(y)q(x∣y)q(y)=logp(y∣x)q(y∣x)+logp(x)q(x):=r(y∣x)+r(x)
Motivation behind the Projection Discriminator
log linear model
- Log linear model is the most popular model for p ( y ∣ x ) p(y|x) p(y∣x). Assume that y y y is a categorical variable taking a value in { 1 , . . . , C } \{1, . . . , C\} {1,...,C}.
- 如果我们要 softmax 来计算
x
x
x 属于各个类别的概率,则
p ( y = c ∣ x ) = exp ( o c ) ∑ j = 1 C exp ( o j ) p(y=c|x)=\frac{\exp(o_c)}{\sum_{j=1}^C\exp(o_j)} p(y=c∣x)=∑j=1Cexp(oj)exp(oc)其中 o j o_j oj 为神经网络全连接层的输出,我们可以把它分解成全连接层的权重矩阵 V p T V^{pT} VpT (size: C × d L C\times d^L C×dL) (上标 p p p 代表该权重与真实概率 p p p 有关) 与输入向量 ϕ ( x ) \phi(x) ϕ(x) (size: d L × 1 d^L\times 1 dL×1) (代表提取出的 x x x 的 feature) 的乘积。这样代表类别概率的输出向量 o o o 就可以表示为 o = V p T ϕ ( x ) o=V^{pT}\phi(x) o=VpTϕ(x),而其中 o j o_j oj 为:
o j = v j p T ϕ ( x ) o_j=v_j^{pT}\phi(x) oj=vjpTϕ(x)因此,
log p ( y = c ∣ x ) = log exp ( v c p T ϕ ( x ) ) ∑ j = 1 C exp ( v j p T ϕ ( x ) ) = v c p T ϕ ( x ) − log ( ∑ j = 1 C exp ( v j p T ϕ ( x ) ) ) \begin{aligned}\log p(y=c|x)&=\log \frac{\exp(v_c^{pT}\phi(x))}{\sum_{j=1}^C\exp(v_j^{pT}\phi(x))} \\&=v_c^{pT}\phi(x)-\log(\sum_{j=1}^C\exp(v_j^{pT}\phi(x))) \end{aligned} logp(y=c∣x)=log∑j=1Cexp(vjpTϕ(x))exp(vcpTϕ(x))=vcpTϕ(x)−log(j=1∑Cexp(vjpTϕ(x)))设 Z p ( ϕ ( x ) ) : = ∑ j = 1 C exp ( v j p T ϕ ( x ) ) Z^p(\phi(x)):=\sum_{j=1}^C\exp(v_j^{pT}\phi(x)) Zp(ϕ(x)):=∑j=1Cexp(vjpTϕ(x)),则
log p ( y = c ∣ x ) = v c p T ϕ ( x ) − log Z p ( ϕ ( x ) ) \begin{aligned}\log p(y=c|x)&=v_c^{pT}\phi(x)-\log Z^p(\phi(x)) \end{aligned} logp(y=c∣x)=vcpTϕ(x)−logZp(ϕ(x)) - 假设
log
q
(
y
=
c
∣
x
)
\log q(y=c|x)
logq(y=c∣x) 也可以表示为上面的形式,并且使用同样的
ϕ
\phi
ϕ,则下面的对数似然比可表示为
log q ( y = c ∣ x ) p ( y = c ∣ x ) = v c q T ϕ ( x ) − log Z q ( ϕ ( x ) ) − v c p T ϕ ( x ) + log Z p ( ϕ ( x ) ) = ( v c q − v c p ) T ϕ ( x ) − ( log Z q ( ϕ ( x ) ) − log Z p ( ϕ ( x ) ) ) \begin{aligned}\log\frac{q(y=c|x)}{p(y=c|x)}&=v_c^{qT}\phi(x)-\log Z^q(\phi(x))-v_c^{pT}\phi(x)+\log Z^p(\phi(x)) \\&=(v_c^{q}-v_c^{p})^T\phi(x)-(\log Z^q(\phi(x))-\log Z^p(\phi(x))) \end{aligned} logp(y=c∣x)q(y=c∣x)=vcqTϕ(x)−logZq(ϕ(x))−vcpTϕ(x)+logZp(ϕ(x))=(vcq−vcp)Tϕ(x)−(logZq(ϕ(x))−logZp(ϕ(x)))
- 将 log linear model 代入
f
∗
(
x
,
y
)
f^*(x,y)
f∗(x,y) 可得
f ∗ ( x , y = c ) = log q ( y = c ∣ x ) p ( y = c ∣ x ) + log q ( x ) p ( x ) = ( v c q − v c p ) T ϕ ( x ) − ( log Z q ( ϕ ( x ) ) − log Z p ( ϕ ( x ) ) ) + log q ( x ) p ( x ) \begin{aligned}f^*(x,y=c)&=\log \frac{q({y=c} \mid {x})}{p({y=c} \mid {x})}+\log \frac{q({x})}{p({x})} \\&=(v_c^{q}-v_c^{p})^T\phi(x)-(\log Z^q(\phi(x))-\log Z^p(\phi(x)))+\log \frac{q({x})}{p({x})} \end{aligned} f∗(x,y=c)=logp(y=c∣x)q(y=c∣x)+logp(x)q(x)=(vcq−vcp)Tϕ(x)−(logZq(ϕ(x))−logZp(ϕ(x)))+logp(x)q(x) - 设
v
c
=
v
c
q
−
v
c
p
v_c=v_c^{q}-v_c^{p}
vc=vcq−vcp;
ψ
(
ϕ
(
x
)
)
=
−
(
log
Z
q
(
ϕ
(
x
)
)
−
log
Z
p
(
ϕ
(
x
)
)
)
+
log
q
(
x
)
p
(
x
)
\quad\psi(\phi(x))=-(\log Z^q(\phi(x))-\log Z^p(\phi(x)))+\log \frac{q({x})}{p({x})}
ψ(ϕ(x))=−(logZq(ϕ(x))−logZp(ϕ(x)))+logp(x)q(x),则
f ∗ ( x , y = c ) = v c T ϕ ( x ) + ψ ( ϕ ( x ) ) \begin{aligned}f^*(x,y=c)&=v_c^T\phi(x)+\psi(\phi(x)) \end{aligned} f∗(x,y=c)=vcTϕ(x)+ψ(ϕ(x)) - 设
V
V
V 的各行向量为
v
c
T
v_c^T
vcT. 因为
y
y
y 为 one-hot vector,则
f ∗ ( x , y ) = y T V ϕ ( x ) + ψ ( ϕ ( x ) ) = ( V T y ) ⋅ ϕ ( x ) + ψ ( ϕ ( x ) ) f^*(x,y)=y^TV\phi(x)+\psi(\phi(x))=(V^Ty)\cdot \phi(x)+\psi(\phi(x)) f∗(x,y)=yTVϕ(x)+ψ(ϕ(x))=(VTy)⋅ϕ(x)+ψ(ϕ(x))得出了下图的结构 (左路可以看作是在判断 x x x 是否真实,右路可以看作在判断 x x x 是否属于 y y y 类):
We refer to this model of the discriminator as projection for short.