GAN (Generative Adversarial Network)

连理o

已于 2023-03-17 15:24:41 修改

阅读量662

点赞数

分类专栏： # Generative Models 文章标签： GAN

于 2021-07-30 16:34:22 首次发布

本文链接：https://blog.csdn.net/weixin_42437114/article/details/118935205

版权

Generative Models 专栏收录该内容

11 篇文章

订阅专栏

本文深入浅出地讲解了生成对抗网络（GAN）的基本概念、理论基础及其在条件生成、图像转换等领域的应用。探讨了GAN训练中的挑战，如模式崩溃、分离散度度量，并介绍了提升GAN表现的多种技巧，包括Wasserstein GAN、谱归一化等。此外，还涉及了无监督条件生成方法及评价GAN性能的标准。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Basic Idea of GAN

All Kinds of GAN…

在这里插入图片描述

Basic Idea of GAN

Unconditional Generation

在这里插入图片描述

Conditional Generation

We will control what to generate (e.g. 给定文字产生对应图像，给定图像产生另一张图像 (风格转换)…)

Generator

Generator: a neural network (NN), or a function
- Input: a vector; Each dimension of input vector may represent some characteristics
- Output: a high dimensional vector (image / sentence)
Generator outputs a complex distribution: The data we want to generate has a distribution $P_{data}(x)$ ; A generator $G$ is a network. The network defines a probability distribution $P_G(x)$ .
But it is difficult to compute $P_G(x)$ . We can only sample from the distribution. $\Rightarrow$ Hard to measure the closeness between $P_G(x)$ and $P_{data}(x)$ $\Rightarrow$ We need discriminator!

Discriminator

Discriminator: a neural network (NN), or a function
- Input: a high dimensional vector (image / sentence)
- Output: a scalar (Larger value means real, smaller value means fake)

Generator and Discriminator

首先我们需要准备一个由真实图片组成的数据集，然后我们的 Generator v1 由向量生成了一堆图片，但由于一开始 Generator v1 的参数是随机初始化的，它生成的图片实际上就是一堆随意的输出。此时我们就可以训练 Discriminator v1，使它分辨出哪张图片是生成器生成的，哪张图片是真实的；在训练完 Discriminator v1 后，我们转而训练 Generator v1，使它生成的图片能尽量骗过 Discriminator v1 (生成使 Discriminator v1 输出得分高的图片)，这样就得到了 Generator v2…
这样不断地重复，不断得到更好的 Generator 和 Discriminator… 最终我们可以用 Generator 生成真实图片

This is where the term “adversarial” comes from.

Algorithm

Initialize generator and discriminator
In each training iteration:
- Step 1: Fix generator $G$ , and update discriminator $D$ (Discriminator learns to assign high scores to real objects and low scores to generated objects)
- Step 2: Fix discriminator $D$ , and update generator $G$ (Generator learns to “fool” the discriminator); How to implement? 可以把 Generator 和 Discriminator 组合起来，看作一整个网络。我们只需要让最后网络输出的数值越大越好。同时注意，我们在进行参数更新时只调整前几个对应 Generator 的 hidden layer 的参数
算法的数学描述：训练时相当于在进行二分类，因此损失函数使用交叉熵损失

注意 $\tilde V$ 的形式，为了保证 $\log$ 内的值有意义，须在 $D$ 最后加 sigmoid

Note: input 的 vector 采样自某个分布 (Uniform distribution, Gaussian distribution…); 具体这些 vector 采样自哪个分布也是一个需要调整的超参

GAN as structured learning

Structured Learning / Prediction

Output is composed of components with dependency (e.g. output a sequence, a matrix, a graph, a tree …)

Why Structured Learning Challenging?

One-shot / Zero-shot Learning:
- In classification, each class has some examples.
- In structured learning, If you consider each possible output as a “class”, since the output space is huge, most “classes” do not have any training data. So machine has to create new stuff during testing.
Machine has to learn to do planning: Machine generates objects component-by-component, but it should have a big picture in its mind. (Because the output components have dependancy, they should be considered globally.)

Structured Learning Approach

在这里插入图片描述

GAN: 二次元人物头像生成

Source of images: http://zhuanlan.zhihu.com/p/24767059
DCGAN (Deep CNN GAN): http://github.com/carpedm20/DCGAN-tensorflow

In 2019, with StyleGAN ……
Progressive GAN: Progressive Growing of GANs for Improved Quality, Stability, and Variation
Today …… BigGAN: Large Scale GAN Training for High Fidelity Natural Image Synthesis

Theory behind GAN

Generator (Defines a distribution $P_G(x)$ )

A generator $G$ is a network. The network defines a probability distribution $P_G(x;\theta)$

Our objective
$G^*=\argmin_GDiv(P_G,P_{data})$

Given a data distribution $P_{data}(x)$ . We have a distribution $P_G(x;\theta)$ parameterized by $\theta$ . We want to find $\theta$ such that $P_G(x;\theta)$ close to $P_{data}(x)$
理想情况下，我们想使得 $P_{data}(x)$ 和 $P_G(x;\theta)$ 之间的 KL 散度尽量小，最小化 KL 散度等价于最大化似然。然而，对于 $P_{data}(x)$ 和 $P_G(x;\theta)$ 而言，我们能做的只有从中采样，无法轻易得到 $P_{data}(x)$ 和 $P_G(x;\theta)$ 的真实概率，也就难以直接计算出 KL 散度，为此还需要引入 Discriminator

Discriminator (Evaluates the “difference” between $P_G(x)$ and $P_{data}(x)$ )

Discriminator $D$ evaluates the “difference” between $P_G(x)$ and $P_{data}(x)$

intuition: small divergence $\Rightarrow$ hard to discriminate (cannot make objective large); large divergence $\Rightarrow$ easy to discriminate

How to compute the divergence? - Sampling is good enough ……

Although we do not know the distributions of $P_G$ and $P_{data}$ , we can sample from them.

Our objective:
在这里插入图片描述

Given $G$ , what is the optimal $D^*$ maximizing (Assume that $D (x)$ can be any function)
Given $x$ , the optimal $D^*$ maximizing (Since $D (x)$ can be any function)
i.e. Find $D^*$ maximizing: $f(D)=a\log(D)+b\log(1-D)$ (a concave function)
注意到， $D^*$ 的输出值在 0 和 1 之间，也符合在 $D (x)$ 之后加 sigmoid 的做法. 并且当 $G$ 达到最优使得 $P_G(x)=P_{data}(x)$ 时， $D(x)=\frac{1}{2}$ ，Discriminator 将无法分辨真实图片和生成图片。同时，这种训练 Discriminator 的方式也可以作为一种常见的区分两个分布的做法，也就是说，直接从两个分布中采样数据训练二分类器 $D$ ，如果 $D$ 能分辨出数据来自哪个分布，就说明这两个分布不属于同一个分布 (e.g. 可以用来判断训练数据和测试数据是否属于同一分布)
下面我们就可以把 $D^*$ 代入 $V (G, D)$ ，得到 $max_DV(G,D)$ ，发现它与 Jensen-Shannon divergence 相关; 也就是说，我们将 train $D$ 后得到的 $D^*$ 代入 $V$ (objective function) 就可以得到 JS divergence
$\begin{aligned} &\max _{D} V(G, D)=V\left(G, D^{*}\right) \\ =&E_{x \sim P_{\text {data }}}\left[\log \frac{P_{\text {data }}(x)}{P_{\text {data }}(x)+P_{G}(x)}\right]+E_{x \sim P_{\text {G}}}\left[\log \frac{P_{\text {G}}(x)}{P_{\text {data }}(x)+P_{G}(x)}\right] \\=&\int_{x} P_{d a t a}(x) \log \frac{P_{\text {data }}(x)}{P_{\text {data }}(x)+P_{G}(x)} d x+\int_{x} P_{G}(x) \log \frac{P_{\text {G}}(x)}{P_{\text {data }}(x)+P_{G}(x)} d x \\=&\int_{x} P_{d a t a}(x) \log \frac{\frac{1}{2}P_{\text {data }}(x)}{(P_{\text {data }}(x)+P_{G}(x))/2} d x+\int_{x} P_{G}(x) \log \frac{\frac{1}{2}P_{\text {G}}(x)}{(P_{\text {data }}(x)+P_{G}(x))/2} d x \\=&\int_{x} P_{d a t a}(x)[ \log \frac{P_{\text {data }}(x)}{(P_{\text {data }}(x)+P_{G}(x))/2} -\log2]d x+\int_{x} P_{G}(x) [\log \frac{P_{\text {G}}(x)}{(P_{\text {data }}(x)+P_{G}(x))/2}-\log2] d x \\=&-2\log2+\int_{x} P_{d a t a}(x) \log \frac{P_{\text {data }}(x)}{(P_{\text {data }}(x)+P_{G}(x))/2} d x+\int_{x} P_{G}(x) \log \frac{P_{\text {G}}(x)}{(P_{\text {data }}(x)+P_{G}(x))/2} d x \\=&-2 \log 2+\mathrm{KL}\left(P_{\text {data }} \| \frac{P_{\text {data }}+P_{G}}{2}\right)+\mathrm{KL}\left(P_{G} \| \frac{P_{\text {data }}+P_{G}}{2}\right) \\ =&-2 \log 2+2 J S D\left(P_{\text {data }} \| P_{G}\right) \quad\text{{Jensen-Shannon divergence}} \end{aligned}$

JS divergence $\in[0,\log2]$

Algorithm

Our objective
$G^*=\argmin_GDiv(P_G,P_{data})=\argmin_G\ \max _{D} V(G, D)$

How to find $G^*$ ?

(1) Initialize generator and discriminator
(2) In each training iteration:
- Step 1: Fix generator $G$ , and update discriminator $D$ $\Rightarrow$ Given a generator $G$ , $max _{D} V(G, D)$ evaluate the “difference” between $P_G$ and $P_{data}$
- Step 2: Fix discriminator $D$ , and update generator $G$ $\Rightarrow$ Pick the $G$ defining $P_G$ most similar to $P_{data}$

Notation

$L(G)=\max_DV(G,D)$ ; i.e. loss function for generator

Algorithm

Given $G_0$
Find $D_0^*$ maximizing $V(G_0,D)$ (Using Gradient Ascent): $V(G_0,D_0^*)$ is the JS divergence between $P_{data}(x)$ and $P_{G_0}(x)$
Obtain $G_1$ : Decrease JS divergence (?)
Find $D_1^*$ maximizing $V(G_1,D)$ : $V(G_1,D_1^*)$ is the JS divergence between $P_{data}(x)$ and $P_{G_1}(x)$
Obtain $G_2$ : Decrease JS divergence (?)
…

Decrease JS divergence (?)

注意到，我们上面在 Algorithm 中注明了，在 train Generator 时 (对 $L (G)$ 作梯度下降) 未必会使 JS divergence 减少。原因是当 Generator 改变时，用同一个 Discriminator 计算出的 $V (G, D)$ 就不是在衡量 JS divergence 了
那么为什么我们说对 $L (G)$ 作梯度下降可以看作减少 JS divergence 呢？这是因为我们新增了假设： $D_0^*\approx D^*_1$
- 该假设要求我们: Don’t update $G$ too much

In practice, how to compute $max_DV(G,D)$

在这里插入图片描述

We can use sampling to approximate expectation
Sample ${x^1,x^2,...,x^m\}$ from $P_{data}(x)$ , sample $\{\tilde x^1,\tilde x^2,...,\tilde x^m\}$ from generator $P_G(x)$

Summary

train Discriminator 是为了衡量 JS divergence，因此理论上我们想要让每个 iteration 中都将 Discriminator 训练至收敛。但实际上我们只需进行 $k$ 次 Gradient Ascent 得到 JS divergence 的一个大致的 lower bound 即可，不必训练 $D$ 至收敛 (即使我们训练至收敛，仍然可能收敛至 local minima 或者由于 $D$ 的表现能力有限，无法到达 global minima)，甚至在 train $D$ 时可以只更新 1 次参数，也可以得到不错的效果 (GAN paper 中使用的就是 $k = 1$ )
注意到之前关于 Decrease JS divergence (?) 的讨论中作出的假设。为了维持这个假设，更新 Generator 参数时不能使其更新幅度过大，因此我们在每个 iteration 中只对 $G$ 的参数进行 1 次梯度下降。在 $G$ 的参数缓慢改变的情况下， $D$ 也更有可能达到最优解
注意到，在 train $G$ 时，由于 $D$ 的参数固定，因此 $\tilde V$ 的第一项与 $G$ 无关，在训练时只需将 $\tilde V$ 的第二项作为优化目标即可

在这里插入图片描述

注意这里 $G$ 的优化目标和最开始介绍的不一样，在 Basic Idea of GAN 一节中，是将 $D (G (z))$ 看作正样本，使用交叉熵损失训练 $G$ ，而这里则是直接最小化 $V(G,D^*)=-2 \log 2+2 J S D\left(P_{\text {data }} \| P_{G}\right)$ ，具有更强的数学意义

Objective Function for Generator in Real Implementation

Minimax GAN (MMGAN): 在开始训练 Generator 时， $D (G (z))$ 会比较小，代表 Generator 生成的图片无法骗过 Discriminator，而此时 $\log(1-D(G(z)))$ 微分很小 (saturate)，训练会变得很慢
Non-saturating GAN (NSGAN): 为了改善上面的缺点，可以将 $\log(1-D(x))$ 替换为 $-\log(D(x))$ 。它们的趋势一样，但在开始时训练速度更快 (这其实就和 Basic Idea of GAN 一节中的损失函数一样了，可以看作是使用交叉熵损失来训练 $G$ ，但没有理论支撑)

fGAN: General Framework of GAN

paper: f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization
- One sentence: you can use any f-divergence (fGAN 可以让我们最小化各种不同的 divergence；但实际上它们的差别是比较小的)

f-divergence

f-divergence

$D_f(P||Q)$ evaluates the difference of $P$ and $Q$
可以发现，这种形式能覆盖我们见过的很多概率分布之间的度量了
$\begin{array}{c|c|c}\hline \textbf{距离名称} & \textbf{计算公式} & \textbf{对应的}f\\ \hline \text{总变差} & \frac{1}{2}\int | p(x) - q(x)| dx & \frac{1}{2}|u - 1|\\ \hline \text{KL散度} & \int p(x)\log \frac{p(x)}{q(x)} dx & u \log u\\ \hline \text{逆KL散度} & \int q(x)\log \frac{q(x)}{p(x)} dx & - \log u\\ \hline \text{Pearson }\chi^2 & \int \frac{(q(x) - p(x))^{2}}{p(x)} dx & \frac{(1 - u)^{2}}{u}\\ \hline \text{Neyman }\chi^2 & \int \frac{(p(x) - q(x))^{2}}{q(x)} dx & (u - 1)^{2}\\ \hline \text{Hellinger距离} & \int \left(\sqrt{p(x)} - \sqrt{q(x)}\right)^{2} dx & (\sqrt{u} - 1)^{2}\\ \hline \text{Jeffrey距离} & \int (p(x) - q(x))\log \left(\frac{p(x)}{q(x)}\right) dx & (u - 1)\log u\\ \hline \text{JS散度} & \frac{1}{2}\int p(x)\log \frac{2 p(x)}{p(x) + q(x)} + q(x)\log \frac{2 q(x)}{p(x) + q(x)} dx & -\frac{u + 1}{2}\log \frac{1 + u}{2} + \frac{u}{2} \log u\\ \hline \end{array}$

$f$ 的性质

(1) 它们都是非负实数到实数的映射（ $R^∗→R$ ）
(2) $\boldsymbol{f(1) = 0}$ ；这一性质保证了 $D_f(P||P)=0$
(3) $f$ is convex；这一性质保证了 $D_f(P||Q)\geq0$ ；证明：当 $f$ 为凸函数时，由 Jenson 不等式可知，
$\begin{aligned}\int q(x) f\left(\frac{p(x)}{q(x)}\right)dx =& \mathbb{E}_{x\sim q(x)} \left[f\left(\frac{p(x)}{q(x)}\right)\right]\\ \geq& f\left(\mathbb{E}_{x\sim q(x)} \left[\frac{p(x)}{q(x)}\right]\right)\\ =& f\left(\int q(x) \frac{p(x)}{q(x)}dx\right)\\ =& f\left(\int p(x)dx\right)\\ =& f(1) = 0 \end{aligned}$ 当然， $f$ 散度原则上并没有保证 $P \neq = Q$ 时 $D_f(P∥Q)>0$ 。但通常我们会选择严格凸的 $f$ （即 $f'' (u)$ 恒大于0），那么这时候可以保证 $P \neq = Q$ 时 $D_f(P∥Q)>0$ ，也就是说这时候有 $D_f(P∥Q)=0⇔P=Q$

Fenchel Conjugate (凸共轭)

凸共轭

记凸函数 $f$ 的定义域为 $\mathbb{D}$ . 选择任意一个点 $ξ$ ， $f (u)$ 在 $u = ξ$ 处的切线一定在凸函数下方，因此有
$\geq f(\xi) + f'(\xi)(u - \xi)$ 因为不等式是恒成立的，并且等号是有可能取到的，因此可以导出
$\max_{\xi\in\mathbb{D}}\big\{f(\xi) - f'(\xi) \xi + f'(\xi)u\big\}$ $f (u)$ 为凸函数，在已知 $f'(\xi)$ 时能反解出 $\xi$ ，因此可以记 $t=f'(\xi)$ ，此时 $g(t)=f(\xi) - f'(\xi) \xi$ 就可以看作 $t$ 的一个函数，代入上式可得
$\max_{t\in f'(\mathbb{D})}\big\{t u - g(t)\big\}$ 其中， $g(t)=f(\xi) - f'(\xi) \xi$ 即为 $f (u)$ 的共轭函数。留意花括号里边的式子，给定 $f$ 后， $g$ 也确定了，并且整个式子关于 $u$ 是线性的。所以总的来说，我们做了这样的一件事情：对一个凸函数给出了线性近似，并且通过最大化里边的参数就可以达到原来的值
注意给定 $u$ ，我们都要最大化一次 $t$ 才能得到尽可能接近 $f (u)$ 的结果，否则随便代入一个 $t$ ，只能保证得到下界，而不能确保误差大小。所以它称为 “局部变分方法”，因为要在每一个点（局部）处都要进行最大化（变分）。这样一来，我们可以理解为 $t$ 实际上是 $u$ 的函数，即
$\max_{T\text{是值域为}f'(\mathbb{T})\text{的函数}}\big\{T(u) u - g(T(u))\big\}$

凸函数的共轭函数

$\begin{array}{c|c}\hline f(u) & \textbf{对应的共轭}g(t) & f'(\mathbb{D}) & 激活函数\\ \hline \frac{1}{2}|u - 1| & t & \left[-\frac{1}{2},\frac{1}{2}\right] & \frac{1}{2}\tanh(x)\\ \hline u \log u & e^{t-1} & \mathbb{R} & x\\ \hline -\log u & -1 - \log(-t) & \mathbb{R}_- & -e^{x}\\ \hline \frac{(1 - u)^{2}}{u} & 2 - 2\sqrt{1-t} & (-\infty, 1) & 1-e^x\\ \hline (u - 1)^{2} & \frac{1}{4}t^2+t & (-2,+\infty) & e^x-2\\ \hline (\sqrt{u} - 1)^{2} & \frac{t}{1-t} & (-\infty, 1) & 1-e^x\\ \hline (u - 1)\log u & W(e^{1-t})+\frac{1}{W(e^{1-t})}+t-2 & \mathbb{R} & x\\ \hline -\frac{u + 1}{2}\log \frac{1 + u}{2} + \frac{u}{2} \log u & -\frac{1}{2}\log(2-e^{2t}) & \left(-\infty,\frac{\log 2}{2}\right) & \frac{\log 2}{2}-\frac{1}{2}\log(1+e^{-x})\\ \hline \end{array}$

e.g. 对于 $f(u)=u\log u$ 而言， $g(t)=f(\xi) - f'(\xi) \xi=\xi$ ，而 $t=f'(\xi)=1+\log\xi$ ，因此共轭函数 $g(t)=\xi=e^{t-1}$

f-散度估计

计算 $f$ 散度有什么困难呢？根据 $f$ 散度定义，
${D}_f(P\Vert Q) = \int q(x) f\left(\frac{p(x)}{q(x)}\right)dx$ 我们同时需要知道两个概率分布 $P, Q$ 才可以计算两者的 $f$ 散度，但事实上在机器学习中很难做到这一点，有时我们最多只知道其中一个概率分布的解析形式，另外一个分布只有采样出来的样本，甚至很多情况下我们两个分布都不知道，只有对应的样本（也就是说要比较两批样本之间的相似性），所以就不能直接根据定义来计算 $f$ 散度了
为了解决上述问题，可以把 $f$ 散度定义中的 $f$ 用其共轭函数替代：
$\begin{aligned}{D}_f(P\Vert Q) =& \int q(x) \max_{T}\left[\frac{p(x)}{q(x)}T\left(\frac{p(x)}{q(x)}\right)-g\left(T\left(\frac{p(x)}{q(x)}\right)\right)\right]dx\\ =& \max_{T}\int\left[p(x)\cdot T\left(\frac{p(x)}{q(x)}\right)-q(x)\cdot g\left(T\left(\frac{p(x)}{q(x)}\right)\right)\right]dx\end{aligned}$ 将 $T\left(\frac{p(x)}{q(x)}\right)$ 记为 $D (x)$ ，那么就有
$\begin{aligned}{D}_f(P\Vert Q) &= \max_{D}\int\left[p(x)\cdot D(x)-q(x)\cdot g\left(D(x)\right)\right]dx\\ &=\max_{D}\Big(\mathbb{E}_{x\sim p(x)}[D(x)]-\mathbb{E}_{x\sim q(x)}[g(D(x))]\Big) \end{aligned}$

f-GAN

Discriminator

在 $f$ 散度估计中，我们可以用 NN 逼近 $D (x)$ ，此时 $D (x)$ 即为 Discriminator，它的目标是逼近 $D_f(P_{data}||P_G)$
$\begin{aligned}D_f(P_{data}||P_G) &= \max_{D}\Big(\mathbb{E}_{x\sim P_{data}}[D(x)]-\mathbb{E}_{x\sim P_G}[g(D(x))]\Big) \end{aligned}$ 其中， $g$ 为 $f$ 的共轭函数。意思就是说：分别从两个分布中采样，然后分别计算 $D (x)$ 和 $g (D (x))$ 的平均值，优化 $D$ ，让它们的差尽可能地大，最终的结果就是 $f$ 散度的近似值了
注意在对凸函数的讨论中，我们在最大化目标的时候，对 $D$ 的值域是有限制的 (i.e. $f'(\mathbb D)$ )。因此，在 $D$ 的最后一层，我们必须设计适当的激活函数，使得 $D$ 满足要求的值域。当然激活函数的选择不是唯一的，参考的激活函数已经列举在前表。注意，尽管理论上激活函数的选取是任意的，但是为了优化上的容易，应该遵循几个原则：1、对应的定义域为 $\R$ ，对应的值域为要求值域（边界点可以忽略）；2、最好选择全局光滑的函数，不要简单地截断，例如要求值域为 $R^+$ 的话，不要直接用 $re l u (x)$ ，可以考虑的是 $e^x$ 或者 $log(1+e^x)$ ；3、选择激活函数时，最好使得它与 $g$ 的复合运算 $g (D (x))$ 比较简单

Generator

进一步可以写出 $G^*$ 的表达式: (Original GAN has different $V (G, D)$ )
$\begin{aligned} G^*&=\argmin_GD_f(P_{data}||P_G) \\&= \argmin_G\max_{D}\Big(\mathbb{E}_{x\sim P_{data}}[D(x)]-\mathbb{E}_{x\sim P_G}[g(D(x))]\Big) \\&= \argmin_G\max_{D}V(G,D) \end{aligned}$

f-GAN

现在我们可以根据我们想要 minimize 的 $f$ divergence，找出 $f$ 共轭函数 $g$ ，然后就能求得 $V (G, D)$ ，进而训练 GAN 来最小化改 f divergence 了！
例如，对于 JS 散度 (vanilla GAN) 而言， $f(u)=-\frac{u + 1}{2}\log \frac{1 + u}{2} + \frac{u}{2} \log u$ ， $-\frac{1}{2}\log(2-e^{2t})$ ，激活函数为 $\frac{\log 2}{2}-\frac{1}{2}\log(1+e^{-x})$ . 设 $D (x)$ 中施加激活函数前的结果为 $V (x)$ ，则
$\begin{aligned} D(x)&=\frac{\log 2}{2}-\frac{1}{2}\log\left(1+e^{-V(x)}\right) \\&=\frac{\log 2}{2} + \frac{1}{2}\log \sigma(V(x))\\ g(D(x))&=-\frac{1}{2}\log(2-e^{\log 2 +\log \sigma(V(x))}) \\&=-\frac{1}{2}\log(2-2 \sigma(V(x))) \\&=-\frac{1}{2}\log\frac{2e^{-V(x)}}{1+e^{-V(x)}} \\&=-\frac{\log 2}{2} - \frac{1}{2}\log \left(1-\sigma(V(x))\right) \end{aligned}$ 因此，
$\begin{aligned} V(G,D)&=\mathbb{E}_{x\sim P_{data}}[D(x)]-\mathbb{E}_{x\sim P_G}[g(D(x))] \\&=\log2+ \frac{1}{2}\left\{\mathbb{E}_{x\sim P_{data}}[\log \sigma(V(x))]+\mathbb{E}_{x\sim P_G}[\log \left(1-\sigma(V(x))\right)] \right\} \end{aligned}$

下面我们来看， $f$ divergence 到底是想要解决什么问题呢？

Mode Collapse, Mode Dropping

Mode Collapse

Mode Collapse: 在 train GAN 的时候，real data 的 distribution 很大，但 generated data 的 distribution 却很小
- e.g. 如下图所示，在做图像生成时，输出的图片来来回回就那几张
Mode collapse is easy to detect.

Mode Dropping

Mode Dropping: real data 的 distribution 可能有多个 mode，但 generated data 确涵盖了其中一部分 mode。表面看起来 generated data 能会觉得还不错，而且多样性也够，但其实产生出来的数据只有真实数据的一部分
Mode missing: 如果 Discriminator 对 database 中的某张图片输出 score 特别高，那么可能这张图片就属于 missing mode (Generator 不会产生这样的图片)

Why?

之所以会发生 Mode Collapse 和 Mode Dropping 直观上还是比较容易理解的：当 Generator 学会产生某种图片以后，它发现这种图片总能骗过 Discriminator，于是它就一直生成这种图片
Dive deeper: Flaw in Optimization? (just a guess…): 当 $P_{data}>0, P_G=0$ 时，KL divergence $\rightarrow\infty$ ，因此最小化 KL divergence 可能会使 $P_G$ 尽可能覆盖所有 $P_{data}$ ，不会出现 Mode collapse 但最后生成图片的质量不会太高；而如果最小化 Reverse KL，当 $P_{G}>0, P_{data}=0$ 时，Reverse KL divergence $\rightarrow\infty$ ，因此 Generator 可能会变得相当保守，进而出现 Mode collapse
- 但实验结果证明，选择不同的 divergence 并不能有效缓解 Mode Collapse 或 Mode Dropping，因此 f-GAN 这个工作更大的价值在于 “统一”，从生成模型的角度，并没有什么突破

Ensemble

可以通过集成学习来有效避免 Mode Collapse 和 Mode Dropping。例如我们要产生 25 张图片，那么我们就可以训练 25 个 GAN，每个 GAN 各生成 1 张图片。这样即使每个 GAN 都遇到了 Mode Collapse 或 Mode Dropping 的问题，最后生成的 25 张图片也会是不太一样的 (如果只生产一张图片，那么我们可以随机选择一个 Generator 进行生成)

Double-loop v.s. Single-step

在这里插入图片描述

Tips for Improving GAN

DCGANs

LSGAN

原始的 GAN 中还有一个问题，就是 $D$ 最后的激活函数使用 sigmoid，容易导致梯度消失 $\Rightarrow$ Least Square GAN (LSGAN): Replace sigmoid with linear (replace classification with regression) (将分类问题转化为回归问题)

Wasserstein GAN (WGAN)

Improved WGAN (WGAN-GP)

在这里插入图片描述

$V (G, D)$ 中已经没有了 $\log$ 函数，因此没必要用 sigmoid 来限制 $D (x)$ 范围了

Spectrum Norm (SNGAN)

paper: Spectral Normalization for Generative Adversarial Networks

Spectral Normalization → Keep gradient norm smaller than 1 everywhere

Energy-based GAN (EBGAN)

paper: Energy-based Generative Adversarial Network
video: https://www.youtube.com/watch?v=gFaqKdcCdOE

Using an autoencoder as discriminator $D$
- Using the negative reconstruction error of auto-encoder to determine the goodness (reconstruction error 越低，就认为 image 的 quality 越高)
- Benefit: The auto-encoder can be pre-train by real images without generator. (与之相比，基于 NN 的 Discriminator 在训练时需要 negative examples，因此无法 pretrain)
Auto-encoder based discriminator only gives limited region large value.

GAN is still challenging …

GAN 是非常难训练的，要想让网络训练起来，往往需要调整一下超参 (GAN training is dynamic, and sensitive to nearly every aspect of its setup (from optimization parameters to model architecture).)
我们可以简单地从它的结构上来分析: Generator and Discriminator needs to match each other 。也就是说，在训练时，如果 Generator 和 Discriminator 之中有一个不再进步，另一个也会跟着停止进步

More Tips

Conditional Generation by GAN

Conditional GAN

paper:
- Conditional GAN: Conditional Generative Adversarial Nets
- Class conditional image generation: Conditional Image Synthesis With Auxiliary Classifier GANs

Text-to-Image

Traditional supervised approach: Problem: 同一种叙述可能对应多张图片，而 NN 会尝试 minimize 跟所有图片的 distance，最终可能产生一张 blurry image (It is blurry because it is the average of several images).
- e.g. Text: “train”; Annotation: 各种不同角度、不同种类的火车照片; 最终网络的输出可能是多种火车混合的一张模糊图像 (A blurry image!)

Conditional GAN

Generator: 除了一个 vector $\boldsymbol z$ 以外，还给定一段 text (condition)，最终生成一副相关的 image；注意到 $\boldsymbol z$ 为一个 distribution，因此 $\boldsymbol x$ 也为一个 distribution (Generator learns to approximate $P (x ∣ c)$ )
- Why output a distribution?
  - The same input has different outputs $\Rightarrow$ Especially for the tasks needs “creativity” (For Conditional Generation)
  - avoid generating blurry image
- 当然为了避免 Generator 无视 condition，也可以给 Generator 加 dropout 而省略 $z$ ，这样仍然可以让输出有一个 random 的效果
Discriminator: 如果沿用之前的 Discriminator，那么 Generator 只能学会产生真实的图像 (But completely ignore the input conditions); 因此需要作如下改进:
- Training data: $(\hat c,\hat x)$
- Positive example: $(\hat c,\hat x)$ ; Negative example: $(\hat c,G(\hat c)),(\hat {c'},\hat x)$
Training algorithm

注意，在训练 Discriminator 时，最大化的目标中包含了两种错误情况 (fake image、condition 与真实图片不匹配)

最后的式子中应为 $\theta_g\leftarrow\theta_g+\eta\nabla\tilde V(\theta_g)$ (gradient ascent)

不同的 Discriminator 架构

在这里插入图片描述

下面的架构可以更好地分辨两种不同的 err (生成图片不够 realistic；条件与图片不匹配)

StackGAN

paper: StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks

idea: 先产生低分辨率的图片，再逐步产生更高分辨率的图片

在这里插入图片描述

Image-to-image (Patch GAN)

paper: Image-to-Image Translation with Conditional Adversarial Networks

在这里插入图片描述

我们的目标是由几何图形生成真实的房屋建筑，在下图中，
- close 表示使用 traditional supervised learning (使输出的图片与真实图片尽可能相近)，可以看出，close 生成的图片比较模糊
- GAN 表示使用 GAN (Conditinal GAN)，可以看出，GAN 生成的图片更加清晰，但也多了一些其他奇奇怪怪的结构
- GAN + close 表示在 GAN 的基础上，在训练 Generator 时，增加一个优化目标，不仅要使 Discriminator 输出的分数更高，也要使 Generator 生成的图片与真实图片尽可能接近 (如图中红色箭头所示)；可以看出，GAN + close 生成的图片效果还是不错的

Patch GAN

在上面的 Image-to-image 中，作者还提出了 Patch GAN。通过改进 Discriminator 的结构来提高模型效果。传统的 Discriminator 是直接输入整张图片输出最终的得分，但在针对大图片时，网络需要的参数可能比较多，开销较大且训练时容易过拟合。而 Patch GAN 的主要思想就是针对大图片，一次只查看图片的一部分 (patch)，输出该部分的得分 (具体的 patch 大小则是一个超参了)

Speech Enhancement

e.g. 去掉语音中的杂音

下面的语音用 spectrum 表示，因此可以直接套用图像处理的网络架构
Conditional GAN

Video Generation

Generator: 给 Generator 看一段影片，让它预测影片接下来发生的事情

Unsupervised Conditional Generation

Unsupervised Conditional Generation

Transform an object from one domain to another without paired data (e.g. style transfer; 我们只有一堆风景照和一堆艺术画，但风景照和艺术画之间并不是两两对应的)

Approach 1: Direct Transformation (For texture or color change)

在这里插入图片描述

Direct Transformation

在这里插入图片描述

Problem: ignore input (Discriminator 只负责判别画是否属于艺术画，因此 Generator 可能学会只输出某些艺术画，使得输出的画与输入的照片完全无关)
- The issue can be avoided by network design. Simpler generator makes the input and output more closely related. (shallow network 不太受这个问题的影响，可以直接 train)

Encoder Network

在这里插入图片描述

CycleGAN

paper: Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks

在这里插入图片描述

我们也可以同时 learn 两个 Generator 和 Discriminator

在这里插入图片描述

Issue of Cycle Consistency

paper: CycleGAN: a Master of Steganography (隐写术)
CycleGAN 会把 input 的信息藏起来，输出的时候再把它呈现出来 (Generator 把信息藏在了人看不出来的地方) (e.g. 下图中屋顶上的黑点消失了)

Related Work

Dual GAN
Disco GAN

跟 CycleGAN 一样的方法 (不同的人在同一时间想出来的，发表在了不同的会议上…)

StarGAN (multiple domains)

paper: StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation

StarGAN 只需 1 个 Generator 和 1 个 Discriminator 就可以实现多个 domian 的互转 (也利用了 Cycle Consistency)

Approach 2: Projection to Common Space (Larger change, only keep the semantics)

在这里插入图片描述

相比于 Direct Transformation，Projection to Common Space 可以支持更大程度的转换

Projection to Common Space

Target

在这里插入图片描述

Training

利用 Auto-Encoder 的思想，相当于 train 两个 Auto-Encoder (分别为图中的红色箭头和蓝色箭头所示)
如果只 learn auto-encoder，decoder output 的 image 会很模糊，因此还可以再加上 Discriminator，这就相当于 train 两个 VAE-GAN

Problem

Because we train two auto-encoders separately, the images with the same attribute may not project to the same position in the latent space.
- latent space: 隐空间；隐空间的作用是为了找到模式 (pattern) 而学习数据特征并且简化数据表示

Sharing the parameters of encoders and decoders

Couple GAN [Ming-Yu Liu, et al., NIPS, 2016]; UNIT [Ming-Yu Liu, et al., NIPS, 2017]

使两个 Encoder 和 Decoder 共享参数 (如下图虚线所示)：Encoder 共享后面几个 layer 的参数，Decoder 共享前面几个 layer 的参数
- 最极端的情况是共享所有参数，这样 Encoder 还需要读入一个 flag 表示图片位于哪个 domain

Domain Discriminator

Domain Discriminator: The domain discriminator forces the output of $EN_X$ and $EN_Y$ have the same distribution. [Guillaume Lample, et al., NIPS, 2017]
- input: latent vector; output: 判断 latent vector 属于哪个 domain

Cycle Consistency:

ComboGAN [Asha Anoosheh, et al., arXiv, 017]

在这里插入图片描述

类似 CycleGAN

Semantic Consistency

Used in DTN [Yaniv Taigman, et al., ICLR, 2017] and XGAN [Amélie Royer, et al., arXiv, 2017]

计算 latent vector 的相似度 $\Rightarrow$ 语义上的相似度

U-GAT-IT

Ref: U-GAT-IT: Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Image Translation

SELFIE2ANIME

To learn more…

在这里插入图片描述

How to Train a GAN? Tips and tricks to make GANs work

Ref: How to Train a GAN? Tips and tricks to make GANs work、怎样训练一个 GAN？一些小技巧让 GAN 更好的工作、训练不稳定、调参难度大，这里有 7 大法则带你规避 GAN 训练的坑！

(1) Normalize the inputs:
- normalize the images between -1 and 1: img / 127.5 - 1
- Tanh as the last layer of the generator output: 生成的图片也要经过判别器，所以生成器的输出也是 -1 到 1 之间 (和原图的区间范围保持一致)
(2) Avoid Sparse Gradients: ReLU, MaxPool
- the stability of the GAN game suffers if you have sparse gradients
- LeakyReLU = good (in both G and D)
- For Downsampling, use: Average Pooling, Conv2d + stride
- For Upsampling, use: PixelShuffle, ConvTranspose2d + stride
(3) Use stability tricks from RL
- Experience Replay
  - Keep a replay buffer of past generations and occassionally show them
  - Keep checkpoints from the past of G and D and occassionaly swap them out for a few iterations
- All stability tricks that work for deep deterministic policy gradients
- See Pfau & Vinyals (2016)
(4) Use the ADAM Optimizer
(5) Track failures early
- $D$ loss goes to 0: failure mode
- check norms of gradients: if they are over 100 things are screwing up; 理想情况下，生成器应该在训练的早期接受大梯度，因为它需要学会如何生成看起来真实的数据。另一方面，判别器则在训练早期则不应该总是接受大梯度，因为它可以很容易地区分真实图像和生成图像。当生成器训练地足够好时，判别器就没有那么容易区分真实图像和生成图像了。它会不断发生错误，并得到较大的梯度
- when things are working, $D$ loss has low variance and goes down over time vs having huge variance and spiking
- if loss of generator steadily decreases, then it’s fooling D with garbage
(6) Dont balance loss via statistics (unless you have a good reason to)
- Dont try to find a (number of G / number of D) schedule to uncollapse training. It’s hard and we’ve all tried it.
- If you do try it, have a principled approach to it, rather than intuition. For example

while lossD > A:
  train D
while lossG > B:
  train G

(7) Use Dropouts in G in both train and test phase
- Provide noise in the form of dropout (50%).
- Apply on several layers of our generator at both training and test time
- https://arxiv.org/pdf/1611.07004v1.pdf

Feature Extraction

InfoGAN

paper: InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets

在 GAN 中，我们需要输入一个采样自某个分布的 vector，并且我们希望在训练 GAN 之后，该 vector 的每一个 dimension 都可以表示某种 characteristic
- Regular GAN: Modifying a specific dimension, no clear meaning (下图中横轴代表改变 input 的某个维度)

The colors represents the characteristics. (以二维的 input vector 为例，我们希望在 latent space 中，不同特征的 object 的分布是有某种规律性的)

What is InfoGAN?

将输入 $z$ 分为了 $c$ 和 $z^{'}$ 两个部分， $c$ 的每个维度都代表图片的某些特征， $z^{'}$ 代表随机的、无法解释的部分
除了 GAN 的结构外，InfoGAN 还新增了一个 Classifier，它需要根据 $x$ 还原出 $c$ (The classifier can recover $c$ from $x$ . $c$ must have clear influence on $x$ )。注意到 Generator 和 Classifier 就形成了一个 Auto-encoder 的结构
同时由于 Classifier 和 Discriminator 都接受 $x$ 作为输入，因此它们可以共享一部分参数

在这里插入图片描述

VAE-GAN

paper: Autoencoding beyond pixels using a learned similarity metric

VAE-GAN
- (1) 用 GAN 来强化 VAE: 前两个部分的 Encoder 和 Generator (Decoder) 可以看作 VAE，如果我们没有 Discriminator 而只是 minimize reconstruction error，那么由于我们很难计算两个 image 之间的 loss，最后生成的图片往往会比较模糊。但是有了 Discriminator 之后，我们还可以通过 cheat Discriminator 来让生成的图像更真实
- (2) 用 VAE 来强化 GAN: 后两个部分的 Generator (Decoder) 和 Discriminator 可以看作 GAN。VAE-GAN 新增了 Encoder，这样可以通过 minimize reconstruction error 来让生成图像更真实

在这里插入图片描述

BiGAN

paper: Adversarial Feature Learning

可以看到，BiGAN 同样也是由 Encoder, Decoder 和 Discriminator 三部分组成的。但 Ecnoder 和 Decoder 并没有使用 Auto-encoder 的结构，而是利用 Discriminator 将 Ecnoder 和 Decoder 联系起来。Discriminator 同时接受 Image $x$ 和 code $z$ 并判断 $(x, z)$ 来自 Encoder 还是 Decoder
那么这么做有什么用呢？可以设 Encoder 的输入和输出组成的 pair 服从联合分布 $P (x, z)$ ，Decoder 的输入和输出组成的 pair 服从联合分布 $Q (x, z)$ 。Discriminator 做的事和 GAN 其实一样，就是衡量这两个分布之间的 difference。而 Encoder 和 Decoder 都尝试欺骗 Discriminator，最终不断迭代使得 $P (x, z)$ 和 $Q (x, z)$ 这两个联合分布越来越接近，最终得到如下的最优 Encoder 和 Decoder:

Algorithm

在这里插入图片描述

这里是让 Discriminator 增加来自 Encoder 的 $(x, z)$ pair 的得分，减少来自 Decoder 的 $(x, z)$ pair 的得分。但实际上反过来也可以 (即，增加来自 Decoder 的 $(x, z)$ pair 的得分，减少来自 Encoder 的 $(x, z)$ pair 的得分)，因为 Discriminator 只是为了衡量 $P$ 和 $Q$ 之间的差别

在这里插入图片描述

注意到，Optimal encoder 和 decoder 在形式上相当于训练了如下的两个 Auto-encoder。但虽然它们在收敛到 optimal solution 时的效果是一样的，但训练时达不到 optima，实验中它们的效果还是有很大不同的 (BiGAN 更容易提取出图片的语义信息，生成清晰的图片，例如给定 1 张鸟的图片，它能给出另一张不太一样的鸟的图片，而 Auto-encoder 则会给出一张比较模糊的原图)

Triple GAN

paper: Triple Generative Adversarial Nets

$D$ : Discriminator, $G$ : Generator, $C$ : Classifier
如果不看 $C$ 的话， $D$ 和 $G$ 就形成了一个 Conditional GAN。 $G$ 的条件输入为 $Y_g$ ，然后输出 $X_g$ 。接着将 $X_g,Y_g)$ 的 pair 输入 $D$ ， $D$ 需要分辨出 $G$ 生成的数据和真实的样本数据
Triple GAN 属于 Semi-supervised learning，也就是说，训练数据中有一小部分为 labeld data，但大部分为 unlabeld data ( $x$ 和 $y$ 不匹配)。我们可以用 labeld data 和 $G$ 生成的 data 去训练 $C$ ，最后使得 $C$ 可以做到输入 $x$ ，输出 $y$

在这里插入图片描述

具体为什么要这么做还得看 paper

Domain-adversarial training

paper: Domain-Adversarial Training of Neural Networks

Training and testing data are in different domains: e.g. 模型的 Training data 和 Testing data 的数据分布不太一样，如果直接拿在 Training data 上训练得到的模型在 Testing data 上做测试，效果不会太好。因此我们可以用 Generator 抽取出 Training data 和 Testing data 的 feature，使抽取出的特征拥有相同的分布

在这里插入图片描述

feature extractor 就是 Generator；Domain classifier 就是 Discriminator，用于衡量 Testing data 和 Training data 之间的 distribution divergence；Label predictor 就是一个分类器，例如给数字作分类

这三个部分可以一起同时训练 (原始论文中采用的方法)，也可以采用类似 GAN 的方法，分开来迭代地进行训练

Photo Editing

video, ppt

Evaluation of GAN

即，如何客观地评估 GAN 生成 object 的好坏

We don’t want memory GAN

在训练 GAN 中，我们不想让 GAN 记住并输出 database 中已有的图片。如果 GAN 输出原图的话，我们可以通过与 database 中图片计算欧氏距离来判别 GAN 是不是输出的原图。但 GAN 也可能会生成原 database 中图片向上/下/左/右移动 1 / 2 / 3… 个 pixel 的图片或者左右翻转图片，这些图片与原图片是非常相似的，但如果用欧氏距离计算它们与 database 中图片距离的话，会发现它们与 database 中图片最像的并不是原图片，而是其他图片。此时我们就比较难判断生成的图片是否为原 database 中的原图
- 例如下图中，在将卡车图片移动 3 个 pixel 之后，与它最相似的图片竟然变成了飞机

Solution: Using k-nearest neighbor to check whether the generator generates new objects

Likelihood

在传统的评估生成模型时，我们可以采样出一些没有被用在训练中的真实样本 $x^i$ ，然后计算其对数似然来评估模型好坏
But we cannot compute $P_G(x^i)$ (in GAN). We can only sample from $P_G$ .

Likelihood - Kernel Density Estimation

Estimate the distribution of $P_G(x)$ from sampling. Each sample is the mean of a
Gaussian with the same covariance. (用 Gaussian Mixture Model 去逼近 $P_G$ )
Now we have an approximation of $P_G$ , so we can compute $P_G(x^i)$ for each real data $x^i$ Then we can compute the likelihood.
- 这个方法是有很多问题的，例如如何确定采样样本的个数、高 Likelihood 未必意味着高质量等

Likelihood v.s. Quality

Low likelihood, high quality?: Considering a model generating good images (small variance)
High likelihood, low quality?: 如下所示， $G_2$ 产生高质量图片的几率只有 $G_1$ 的 100 分之一，但 Likelihood 却只减小了 4.6
$L_{G_1}=\frac{1}{N}\sum_i\log P_G(x^i)\\ L_{G_1}=\frac{1}{N}\sum_i\log \frac{P_G(x^i)}{100}=-\log100+L_{G_1}\approx-4.6+L_{G_1}$

Inception Score (IS)

paper: Salimans, Tim, et al. “Improved techniques for training gans.” Advances in neural information processing systems 29 (2016).

用一个已经训练好的 classifier 来评估生成质量

(1) Concentrated distribution (lower entropy) means higher visual quality (每张图片对应的输出都可以看作一个 distribution，表示图片属于各个类别的概率). 如果我们生成的是 image，那就可以用一个已经训练好的 image classifier 来判断生成质量。如果 image classifier 判定 image 属于某个类别的概率特别高，那么就可以认为我们生成的图片质量比较好
(2) Uniform distribution means higher variety. 我们同样可以评估 GAN 生成图像的 diversity。如下图所示，我们可以采样出 3 张图片让 CNN 分类，从而产生 3 个 distribution。之后我们将这 3 个 distribution 平均起来得到一个 distribution。如果这个 distribution 比较平均，那么说明每一个不同的 class 都被生成了，GAN 生成 object 的比较 diverse

Inception Score

用在 ImageNet 上训练得到的 Incepetion Net 作为分类器，所以叫作 Inception Score

Inception Score:
$\begin{aligned} & {\exp \left(\mathbb{E}_{{x}} \operatorname{KL}(P(y \mid {x}) \| P(y))\right)} \end{aligned}$ 由于我们只需要计算相对大小，因此可以忽略 $\exp$ ；同时在实际操作时，将取期望替换为 $\sum_x$
$\begin{aligned}&\mathbb{E}_{{x}} \operatorname{KL}(P(y \mid {x}) \| P(y))\\ =&\sum_{x} \sum_{y} P(y \mid x) \log \frac{P(y \mid x)}{P(y)} \\ =&\sum_{{x}} \sum_{y} P(y \mid x) \log P(y \mid x)\quad\quad (\text{Negative entropy})\\ &\quad\quad- \sum_{x} \sum_{y} P(y \mid x) \log P(y)\quad\quad (\text{Cross entropy}) \end{aligned}$
Inception Score 越大越好。由上式可以看出，(1) Negative entropy 越大越好 $\Rightarrow$ higher visual quality; (2) Cross entropy 用于衡量两个 distribution 之间的相似度，我们希望它越小越好 $\Rightarrow$ higher diversity

不足：Inception Score 依赖于 classifier 的 training data；如果 Generator 产生的图片很逼真，但不与任何 training data 中的图片相似，那么 Inception Score 也不会很高；或者你生成的都是动漫人脸，但 Inception Net 都将它们看成人脸，此时 IS 就不能用于评估生成图片的质量
解决：FID: 首先提取出 GAN 输出图片与真实图片的 feature，将两者相比，越小越好，可在某些方面弥补 Inception Score 的不足

Fréchet Inception Distance (FID)

FID 是计算真实图像和生成图像的特征向量之间距离的一种度量，它综合表征了相同的域中真实图像和生成图像的 Inception 特征向量之间的距离。FID 直接取 Inception Net 最后一个 hidden layer 的输出作为提取出的 feature。假设生成图片和真实图片都服从 Gaussian Distribution，FID 即为两个分布之间的 Fréchet distance ，因此 FID 越小越好 (FID 在最佳情况下的得分为 0.0，表示两组图像相同)
$\text{FID}(x,y)=\left\|\mu_x-\mu_y\right\|_2^2+\operatorname{Tr}\left(\Sigma_x+\Sigma_y-2\left(\Sigma_x \Sigma_y\right)^{1/2}\right)$ 其中， $x, y$ 为两个不同的高斯分布， $\mu,\Sigma$ 分别为均值向量和协方差矩阵。FID 越小，两个分布间的均值和方差也就越接近
不足：(1) 生成图片和真实图片实际上不一定服从 Gaussian Distribution；(2) 为了计算 Fréchet distance，我们需要采样大量图片

在这里插入图片描述

使用 numpy 计算 FID

def calculate_fid(act1, act2):
    # calculate mean and covariance statistics
    mu1, sigma1 = act1.mean(axis=0), cov(act1, rowvar=False)
    mu2, sigma2 = act2.mean(axis=0), cov(act2, rowvar=False)
    # calculate sum squared difference between means
    ssdiff = numpy.sum((mu1 - mu2)*2.0)
    # calculate sqrt of product between cov
    covmean = sqrtm(sigma1.dot(sigma2))
    # check and correct imaginary numbers from sqrt
    if iscomplexobj(covmean):
        covmean = covmean.real
    # calculate score
    fid = ssdiff + trace(sigma1 + sigma2 - 2.0  covmean)
    return fid