[动手学深度学习]生成对抗网络GAN学习笔记

本文链接：https://blog.csdn.net/Sherlily/article/details/133823319

论文原文：Generative Adversarial Nets (neurips.cc)

李沐GAN论文逐段精读：GAN论文逐段精读【论文精读】_哔哩哔哩_bilibili

论文代码：http://www.github.com/goodfeli/adversarial

Ian, J. et al. (2014) 'Generative adversarial network', NIPS'14: Proceedings of the 27th International Conference on Neural Information Processing Systems, Vol. 2, pp 2672–2680. doi: https://doi.org/10.48550/arXiv.1406.2661

1.4. Adversarial nets

1.5. Theoretical Results

1.5.1. Global Optimality of p_g = p_data

1.5.2. Convergence of Algorithm 1

1.6. Experiments

1.7. Advantages and disadvantages

1.8. Conclusions and future work

2. 知识补充

2.1. Divergence

2.2. 唠嗑一下

1. GAN论文原文学习

1.1. Abstract

①They combined generative model $G$ and discriminative model $D$ together, which forms a new model. $G$ is the "cheating" part which focus on imitating and $D$ is the "distinguishing" part which focus on distinguishing where the data comes from.

②This model is rely on a "minmax" function

③GAN does not need Markov chains or unrolled approximate inference nets

④They designed qualitative and quantitative evaluation to analyse the feasibility of GAN

1.2. Introduction

①The authors praised deep learning and briefly mentioned its prospects

②Due to the difficulty of fitting or approximating the distribution of the ground truth, the designed a new generative model

③They compare the generated model to the person who makes counterfeit money, and the discriminative model to the police. Both parties will mutually promote and grow. The authors ultimately hope that the ability of the counterfeiter can be indistinguishable from the genuine product

④Both $G$ and $D$ are MLP, and $G$ passes random noise

⑤They just adopt backpropagation and dropout in training

corpora 全集；corpus 的复数

counterfeiter n.伪造者;制假者;仿造者

1.3. Related work

①Recent works are concentrated on approximating function, such as succesful deep Boltzmann machine. However, their likelihood functions are too complex to process.

②Therefore, here comes generative model, which only generates samples but does not approximates function. Generative stochastic networks are an classic generative model.

③Their backpropagation:

$\lim_{\sigma\to0}\nabla_{\boldsymbol{x}}\mathbb{E}_{\epsilon\sim\mathcal{N}(0,\sigma^2\boldsymbol{I})}f(\boldsymbol{x}+\epsilon)=\nabla_{\boldsymbol{x}}f(\boldsymbol{x})$

④Variational autoencoders (VAEs) in Kingma and Welling and Rezende et al. do the similar work. However, VAEs are modeled by differentiate hidden units, which is contrary to GANs.

⑤And some others aims to approximate but are hard to. Such as Noise-contrastive estimation (NCE), discriminating data under noise, but limited in its discriminator.

⑥The most relevant work is predictability minimization, it utilize other hiden units to predict given unit. However, PM is different from GAN in that (a) PM focus on objective function minimizing, (b) PM is just a regularizer, (c) the two networks in PM respectively make output similar or different

⑦Adversarial examples distinguish which data is misclassified with no generative function

1.4. Adversarial nets

①They designed a minimax function:

$\min_G\max_DV(D,G)\\\\=\mathbb{E}_{\boldsymbol{x}\sim p_{\mathrm{data}}(\boldsymbol{x})}[\log D(\boldsymbol{x})]+\mathbb{E}_{\boldsymbol{z}\sim p_{\boldsymbol{z}}(\boldsymbol{z})}[\log(1-D(G(\boldsymbol{z})))]$

where $p_{g}$ denotes generator's distribution,

$\boldsymbol{x}$ represents data,

$p_{z}(z)$ denotes prior probability with noise,

$G\left ( z;\theta _{g} \right )$ denotes a differentiable function, namely a MLP layer, with parameters $\theta _{g}$ ,

$D\left ( \boldsymbol{x};\theta _{d} \right )$ also denotes a MLP layer with its output is a scalar, where the scalar is the probability that $\boldsymbol{x}$ is real data exceeds the probability that it is generated data

②They train $G$ and $D$ together with maximizing $D$ and minimizing $G$

③They reckon $D$ is more likely to be overfitting. Hence, k-steps of optimizing $D$ and 1-step optimizing of $G$ is more suitable

④ $G$ is relatively weak in early stages, thus train $G$ first might achieve better results

pedagogical adj.教育学的;教学论的

1.5. Theoretical Results

①Their fitting diagram:

where $D$ is blue and dashed line, $G$ is green and solid line, the real data distribution is black and dashed line,

$D$ converges to $\frac{p_\mathrm{data}(\boldsymbol{x})}{p_\mathrm{data}(\boldsymbol{x})+p_g(\boldsymbol{x})}$ . And when it equals to $\frac{1}{2}$ with $p_{data}=p_{g}$ that means $D$ can not discriminate any data

②Pseudocode of GAN:

1.5.1. Global Optimality of p_g = p_data

①They need to maximize $V$ for $D$ :

$\begin{gathered} V(G,D) =\int_{\boldsymbol{x}}p_{\mathrm{data}}(\boldsymbol{x})\log(D(\boldsymbol{x}))dx+\int_{\boldsymbol{z}}p_{\boldsymbol{z}}(\boldsymbol{z})\log(1-D(g(\boldsymbol{z})))dz \\ =\int_{\boldsymbol{x}}p_{\mathrm{data}}(\boldsymbol{x})\log(D(\boldsymbol{x}))+p_{g}(\boldsymbol{x})\log(1-D(\boldsymbol{x}))dx \end{gathered}$

for any coefficient $a,b$ , the value of expression $a\, log\left ( x \right )+b\, log\left ( 1-x \right )$ achieves its maximum when $x=\frac{a}{a+b}$ . Thus $D=\frac{p_\mathrm{data}(\boldsymbol{x})}{p_\mathrm{data}(\boldsymbol{x})+p_g(\boldsymbol{x})}$

②Then change the original function to:

$\begin{aligned} C(G)& =\operatorname*{max}_{D}V(G,D) \\ &=\mathbb{E}_{\boldsymbol{x}\sim p_{\mathrm{data}}}[\operatorname{log}D_{G}^{*}(\boldsymbol{x})]+\mathbb{E}_{\boldsymbol{z}\sim p_{\boldsymbol{z}}}[\operatorname{log}(1-D_{G}^{*}(G(\boldsymbol{z})))] \\ &=\mathbb{E}_{\boldsymbol{x}\sim p_{\mathrm{data}}}[\operatorname{log}D_{G}^{*}(\boldsymbol{x})]+\mathbb{E}_{\boldsymbol{x}\sim p_{g}}[\operatorname{log}(1-D_{G}^{*}(\boldsymbol{x}))] \\ &=\mathbb{E}_{\boldsymbol{x}\sim p_\mathrm{data}}\left[\log\frac{p_\mathrm{data}(\boldsymbol{x})}{P_\mathrm{data}(\boldsymbol{x})+p_g(\boldsymbol{x})}\right]+\mathbb{E}_{\boldsymbol{x}\sim p_g}\left[\log\frac{p_g(\boldsymbol{x})}{p_\mathrm{data}(\boldsymbol{x})+p_g(\boldsymbol{x})}\right] \end{aligned}$

③ $-log\, 4$ is the minimum of $C\left ( G \right )$ when $D_G^*(\boldsymbol{x})=\frac12$

④KL divergence used for it:

$C(G)=-\log(4)+KL\left(p_\text{data}\left\Vert\frac{p_\text{data}+p_g}2\right.\right)+KL\left(p_g\left\Vert\frac{p_\text{data}+p_g}2\right.\right)$

⑤JS divergence used for it:

$C(G)=-\log(4)+2\cdot JSD\left(p_\text{data}\left\|p_g\right.\right)$

and the authors recognized the non negative nature of JS divergence more, therefore adopting JS divergence

1.5.2. Convergence of Algorithm 1

①The function is convex so that when gradient updating tends to stabilize, it may achieve the global optima

②Parzen window-based log-likelihood estimates:

$\begin{array}{c|c|c}\text{Model}&\text{MNIST}&\text{TFD}\\\hline\text{DBN [3]}&138\pm2&1909\pm66\\\text{Stacked CAE [3]}&121\pm1.6&\mathbf{2110\pm50}\\\text{Deep GSN [5]}&214\pm1.1&1890\pm29\\\text{Adversarial nets}&\mathbf{225\pm2}&\mathbf{2057\pm26}\end{array}$

where they adopt mean loglikelihood of samples on MNIST, standard error across folds of TFD

supremum n.上确界;最小上界;上限

1.6. Experiments

①Datasets: MNIST, Toronto Face Database (TFD), CIFAR-10

②Activation: combination of ReLU and Sigmoid for generator, Maxout for discriminator

③Adopting dropout in discriminator

④Noise is only allowed as the bottommost input

⑤Their Gaussian Parzen window method brings high variance and performs somewhat poor in high dimensional spaces

1.7. Advantages and disadvantages

（1）Disadvantages

①There is no clear representation of $p_{g}\left ( \boldsymbol{x} \right )$

②It is difficult to achieve synchronous updates between $D$ and $G$

（2）Advantages

①No need for Markov chain

②Updating by gradients instead of data

③They can express any distribution

1.8. Conclusions and future work

①Samples (left) and generative data (right with yellow outlines) in (a) MNIST, (b) TFD, (c) CIFAR-10 (fully connected model), (d) CIFAR-10 (convolutional discriminator and “deconvolutional” generator):