MATHEMATICAL PRINCIPLE OF GAN

最新推荐文章于 2022-09-13 01:31:35 发布

MathsXDC

最新推荐文章于 2022-09-13 01:31:35 发布

阅读量559

点赞数

分类专栏： deep-learning

本文链接：https://blog.csdn.net/Ethara/article/details/73189358

版权

deep-learning 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Mathematical Principle of GAN

Introduction

The potential advantage of deep learning lies in representing complicate probability densities of data with models of large scale hierarchical structure. Till now, the great success of deep learning is mostly in discriminative models that usually map high-dimensional sensible inputs to categorical labels. Training discriminative models benefits a lot from back-propagation algorithm, dropout and ReLU with well-defined differentials. In contrast to discriminative models, however, generative deep models perform inferiorly. This is much due to the intractability of optimising joint likelihood.

Based upon the observation above, Goodfellow proposed generative adversarial net (GAN). As the name implies, GAN consists of two network: generator and discriminator. The generator produces some fake data as real as those genuine data that are assumed to be drawn from some unknown distribution (Only God knows what), while the discriminator tries to improve itself in distinguishing the genuine from the fake. The whole process is like a gambling game.

This formulation is actually a general framework, since both the discriminator and the generator can be a deep model of any kind. For sake of simplicity, in this work, we only consider the multi-layer perceptron. Besides, the data generated by the generator are obtained by transforming a random noise. With this approach, the entire model can be trained using back-propagation and dropout without approximation inference and Markov chain.

Generative Adversarial Net

In this section, we will see GAN in detail and derive its optimisation objective.

As mentioned above, both generator and discriminator are MLPs for simplicity. Philosophically, the generator is expected to something that maps a random noise into a generative data; the discriminator takes input either genuine or fake and outputs the probability of being genuine (drawn from the unknown distribution that only God knows). This whole framework is shown as follows,

这里写图片描述

Suppose we have a training set of $m$ samples, $S=\{x^{(1)},...,x^{(m)}\}$ . Besides, given any probability density $p_z(z)$ (with a sufficiently powerful model, the simpler, the better), we can draw $m$ noise samples $\{z^{(1)},...,z^{(m)}\}$ using the random variable $Z\sim p_z(z)$ . Consequently, we can obtain the likelihood

L (x (1), . . ., x (m), z (1), . . ., z (m) | θ g, θ d) = \prod i = 1 m D (x (i)) I {x (i) \in Data} (1 - D (x (i))) I {x (i) \notin Data} \prod j = 1 m D (G (x (j))) I {G (x (j)) \in Data} (1 - D (G (x (j)))) I {G (x (j)) \notin Data} = \prod i = 1 m D (x (i)) \prod j = 1 m (1 - D (G (x (j))))

$\begin{split} L(x^{(1)},...,x^{(m)},z^{(1)},...,z^{(m)}|\theta_g,\theta_d) & = \prod_{i=1}^m{D(x^{(i)})^{I\{x^{(i)}\in \text{Data}\}}}{(1-D(x^{(i)}))^{I\{x^{(i)}\notin \text{Data}\}}} \prod_{j=1}^m{D(G(x^{(j)}))^{I\{G(x^{(j)})\in \text{Data}\}}}{(1-D(G(x^{(j)})))^{I\{G(x^{(j)})\notin \text{Data}\}}} \\ & = \prod_{i=1}^m{D(x^{(i)})} \prod_{j=1}^m{(1-D(G(x^{(j)})))} \end{split}$

Further, the log likelihood is obtained

l o g L = l o g ⎛ ⎝ ⎜ ⎜ \prod i = 1 m D (x (i)) \prod j = 1 m (1 - D (G (x (j)))) ⎞ ⎠ ⎟ ⎟ = \sum i = 1 m l o g D (x (i)) + \sum j = 1 m l o g (1 - D (G (z (j))))

$logL=log\left(\prod_{i=1}^m{D(x^{(i)})} \prod_{j=1}^m{(1-D(G(x^{(j)})))}\right) = \sum_{i=1}^m{logD(x^{(i)})}+\sum_{j=1}^m{log(1-D(G(z^{(j)})))}$

According to Large Number Theorem, as $m\to\infty$ , the empirical loss is exploited to approximate the expectation loss,

l o g L \approx  x \sim p d a t a (x) [l o g D (x)] +  z \sim p z (z) [l o g (1 - D (x))]

$logL\approx \mathcal{E}_{x\sim p_{data}(x)}[logD(x)]+\mathcal{E}_{z\sim p_z(z)}[log(1-D(x))]$

Back to our aim: the whole process is tantamount to a gambling game. The generator produces some fake data as real as those genuine data that are assumed to be drawn from some unknown distribution (Only God knows what), while the discriminator tries to improve itself in distinguishing the genuine from the fake. Note that the log likelihood above is the objective with respect to the discriminator $D(\cdot)$ . Therefore, on one hand, we want to optimize the learnable parameters of the discriminator by maximizing the log likelihood; on the other, we want to optimize those of the generator by minimizing the log likelihood. Then, our optimization objective becomes

m i n G m a x D V (D, G) =  x \sim p d a t a (x) [l o g D (x)] +  z \sim p z (z) [l o g (1 - D (x))]

$min_Gmax_D V(D, G) = \mathcal{E}_{x\sim p_{data}(x)}[logD(x)] + \mathcal{E}_{z\sim p_z(z)}[log(1-D(x))]$

Theoretical Results

The generator $G(\cdot)$ implicitly defines a probability distribution with pdf $p_g$ . By “implicitly”, I mean $G$ is a mapping from the noise $z\sim p_z$ to the fake sample $G(z)$ . Then, it is natural to raise a question: whether $p_g$ will reach $p_{data}$ , the pdf of the unknown distribution only God knows? Or, how much between them if not? The following proposition and theorem will answer this question.

Firstly, given any generator $G$ , we consider the optimal discriminator $D$ .

Proposition 1. For a given $G$ , the optimal discriminator is given by $D_G^*(x) = {p_{data}(x) \over p_{data}+p_g(x)}$

Proof.
For the given $G$ , we maximize the quantity $V(G, D)$

V (G, D) = \int χ p d a t a (x) l o g (D (x)) d x + \int Ω p z (z) l o g (1 - D (G (z))) d z

$V(G, D)=\int_\chi p_{data}(x)log(D(x))dx + \int_\Omega p_z(z)log(1-D(G(z)))dz$

where， $\chi$ and $\Omega$ are the sample space of $x$ and $z$ , respectively.

The second term, by $x=G(z)$ , becomes

\int Ω p z (z) l o g (1 - D (G (z))) d z = \int χ p g (x) l o g (1 - D (x)) d x

$\int_\Omega p_z(z)log(1-D(G(z)))dz = \int_\chi p_g(x)log(1-D(x))dx$

This equality is not that straightforward. (Refer to the details in Appendix C: Change of Random Variable in Measure Theory)

Consequently，

V (G, D) = \int χ p d a t a (x) l o g (D (x)) + p g (x) l o g (1 - D (x)) d x

$V(G, D)=\int_\chi p_{data}(x)log(D(x)) + p_g(x)log(1-D(x))dx$

Since $G$ is fixed, the unknown generative distribution that God created is thus fixed. This yields the non-zero functions $p_{data}$ and $p_g$ . Our aim is to find $D$ that maximizes $V(G, D)$ . To this end, we take variational derivative of $D$ (See Appendix A), set it to 0, and obtain the one that achieves the objective optimization

D * G (x) = p d a t a ( x ) p d a t a ( x ) + p g ( x )

$D_G^*(x) = {p_{data}(x) \over {p_{data}(x)+p_g(x)}}$

This completes the proof. #

By this proposition, we further reformulate this min-max gambling game as minimizing the quantity $C(G)$ , where

C (G) = m a x D V (G, D) =  x \sim p d a t a [l o g p d a t a ( x ) p d a t a ( x ) + p g ( x )] +  x \sim p g [l o g p g ( x ) p d a t a ( x ) + p g ( x )]

$C(G) = max_D V(G, D) = \mathcal{E}_{x\sim p_{data}}{\left[log{p_{data}(x) \over p_{data}(x)+p_g(x)}\right]} + \mathcal{E}_{x\sim p_g}\left[log{p_g(x) \over p_{data}(x) + p_g(x)}\right]$

Now, by presenting Theorem 1, we answer the question raised at the beginning: whether $p_g$ will reach $p_{data}$ , the pdf of the unknown distribution only God knows? Or, how much between them if not?
Theorem 1. The global optimal minimum of $C(G)$ can be obtained at $p_g=p_{data}$ , and the value is $-log4$ .

Proof.

C (G) =  x \sim p d a t a [l o g p d a t a ( x ) p d a t a ( x ) + p g ( x )] +  x \sim p g [l o g p g ( x ) p d a t a ( x ) + p g ( x )] =  x \sim p d a t a [l o g 2 p d a t a ( x ) p d a t a ( x ) + p g ( x ) - l o g 2] +  x \sim p g [l o g 2 p g ( x ) p d a t a ( x ) + p g ( x ) - l o g 2] = - l o g 4 + D K L (p d a t a ∥ ∥ ∥ p d a t a + p g 2) + D K L (p g ∥ ∥ ∥ p d a t a + p g 2) \geq - l o g 4

$\begin{split} C(G) & = \mathcal{E}_{x\sim p_{data}}{\left[log{p_{data}(x) \over p_{data}(x)+p_g(x)}\right]} + \mathcal{E}_{x\sim p_g}\left[log{p_g(x) \over p_{data}(x) + p_g(x)}\right] \\ & = \mathcal{E}_{x\sim p_{data}}{\left[log{2 p_{data}(x) \over p_{data}(x)+p_g(x)}-log2\right]} + \mathcal{E}_{x\sim p_g}\left[log{2 p_g(x) \over p_{data}(x) + p_g(x)}-log2\right] \\ & = -log4 + D_{KL}\left(p_{data} \Big\| {p_{data}+p_g \over 2}\right) + D_{KL}\left(p_{g} \Big\| {p_{data}+p_g \over 2}\right) \\ & \ge -log4 \end{split}$

The equality holds iff
$p_{data} = p_g = {p_{data} + p_g \over 2}$ . (Refer to Appendix B)
This completes the proof. #

As derived above, the generator is able to duplicate the process of generating data by God.

Formally, we present the GAN training algorithm as below

这里写图片描述

Noticeably, the algorithm presented differs from our optimization objective. Our objective is to optimize wrt D to the end and then optimize wrt G. But, in practice, this will bring prohibitive computational cost and usually result in overfitting. Rather, the algorithm optimizes D for k steps and then optimizes G. This procedure is carried out iteratively till converge. In this way, as long as G changes sufficiently slow, D is always near the optimal solution. Naturally, again, we are quite curious about the quantitative difference between these two procedures. This can be answered by the following proposition.

We present the convergence of the algorithm presented without proof.

Proposition 2. Suppose $G$ and $D$ both have sufficient expressive capabilities and assume that, for a fixed $G$ , $D$ can achieve the optimality, and that $p_g$ is optimized according to

 x \sim p d a t a [l o g D * G (x)] +  x \sim p g [l o g D * G (x)]

$\mathcal{E}_{x\sim p_{data}}{[logD_G^*(x)]} + \mathcal{E}_{x\sim p_g}[logD_G^*(x)]$

then, $p_g$ converges to $p_{data}$ .

For a better understanding, we take a look at the following pedagogical epigraph.

这里写图片描述

The downmost horizontal line represents the range of the noise and x represents the data range. As is shown, $G(z)$ maps the uniform distribution to a non-uniform distribution. The green curve represents the pdf of $x=G(z)$ , $p_g$ ; the black curve represents the pdf of the unknown distribution created by God, $p_{data}$ ; the blue curve is the decision plane of $D$ . (a) Before convergence, $D$ partly correctly distinguishes the data; (b) While the internal optimisation is being carried out, that is, with the current G fixed, D is optimized. The optimization result is shown in the blue curves; (c) After the internal optimisation is completed, G is optimized with D fixed. We can see the derivative of D drives G to approach the decision plane of D; (d) At last, suppose G and D have sufficient capabilities, they reach an equilibrium, i.e. $p_g=p_{data}$ . Now D cannot distinguish the genuine from the fake, that is, $D(x)={1\over 2}$ .

Appendix

A. K-L Divergence

In probability and information theory, K-L divergence, also called information gain, is a “measure” of the distance between two probability densities. Here its formal definition is presented, in discrete form and continuous form, respectively.

Given two probability densities of two discrete random variables, $P$ and $Q$ , the discrete form of K-L divergence is defined as

D K L (P ∥ Q) : = \sum i P (i) l n P ( i ) Q ( i ) =  i \sim P (i) [l n P ( i ) Q ( i )]

$D_{KL}(P \| Q) := \sum_i P(i)ln{P(i) \over Q(i)} = \mathcal{E}_{i\sim P(i)}\left[ln{P(i) \over Q(i)}\right]$

Given two probability densities of two continuous random variables, $p$ and $q$ , the continuous form of K-L divergence is defined as

D K L (p ∥ q) : = \int \infty - \infty p (x) l n p ( x ) q ( x ) d x =  x \sim p (x) [l n p ( x ) q ( x )]

$D_{KL}(p \| q) := \int_{-\infty}^{\infty} p(x)ln{p(x) \over q(x)} dx = \mathcal{E}_{x\sim p(x)}\left[ln{p(x) \over q(x)}\right]$

Mathematically, K-L divergence is not a measure, since it violates one of the maxima of measure, i.e. symmetry, that is, $D_{KL}(p \| q)\ne D_{KL}(q \| p)$ . The former place, e.g. $P$ (or $p$ ), can be construed as the real distribution of data, while the latter place, e.g. $Q$ (or $q$ ), can be regarded as an approximation of the real distribution. An alternative to understanding K-L divergence is that $D_{KL}(P \| Q)$ represents the information gain from the prior Q to the posterior P.

There are a few important properties of K-L divergence:

(1) K-L divergence is well-defined if and only if for some $x$ such that $q(x)=0$ , $p(x)=0$ holds;

(2) For some $x$ such that $p(x)=0$ , we have $p(x)ln{p(x) \over q(x)}=0$ ;(since $\lim_{x\to 0}xlnx=0$ )

(3) $D_{KL}(p\|q) \ge 0$ . The equality holds iff $p=q$ .

Now we prove the third property.

D K L (p ∥ q) = - \int \infty - \infty p (x) l n q ( x ) p ( x ) d x \geq - \int \infty - \infty p (x) (q ( x ) p ( x ) - 1) = 0

$D_{KL}(p\|q) = -\int_{-\infty}^{\infty} p(x)ln{q(x) \over p(x)} dx \ge -\int_{-\infty}^{\infty} p(x)\left({q(x) \over p(x)}-1\right)=0$
This completes the proof.#

B. Variational Calculus

Variational calculus is a natural extension of differentiation.

Given a functional $F[y]:y(x) \mapsto \mathcal{K}$ , where $\mathcal{K}=\mathcal{R} \text{ or } \mathcal{C}$ , in parallel with Taylor expansion, functional expansion is defined as follows, for any $\eta(\cdot)$ ，

F [y (x) + ϵ η (x)] = F [y (x)] + ϵ \int (δ F δ y) (x) η (x) d x + o (ϵ 2)

$F[y(x)+\epsilon\eta(x)]=F[y(x)]+\epsilon\int\left({\delta F \over \delta y}\right)(x)\eta(x)dx+o(\epsilon^2)$

C. Change of Random Variable in Measure Theory

Proof:

\int Ω p z (z) l o g (1 - D (G (z))) d z = \int χ p g (x) l o g (1 - D (x)) d x

$\int_\Omega p_z(z)log(1-D(G(z)))dz = \int_\chi p_g(x)log(1-D(x))dx$
where

x=G(z) $x=G(z)$ .

First, we define a measure space $(\Omega, \mathcal{F}, \mathbb{P})$ , where $\Omega$ is the sample space of $z$ , $\mathcal{F}$ is $\sigma$ -field. Further, note that $G(\cdot)$ is a measurable function: $(\Omega, \mathcal{F}) \mapsto (\chi, \mathcal{G})$ , where $\chi$ is the sample space of $x$ , $\mathcal{G}$ is $\sigma$ -field of $\chi$ . Therefore, we have

\int Ω p z (z) l o g (1 - D (G (z))) d z = \int Ω l o g (1 - D (G (z))) d ℙ (z) = \int χ l o g (1 - D (x)) d ℙ \circ G - 1 = \int χ l o g (1 - D (x)) d ℙ G = \int χ p g (x) l o g (1 - D (x)) d x

$\begin{split} \int_\Omega p_z(z)log(1-D(G(z)))dz & = \int_\Omega log(1-D(G(z)))d\mathbb{P}(z) \\ & = \int_\chi log(1-D(x))d\mathbb{P}\circ G^{-1} \\ & = \int_\chi log(1-D(x))d\mathbb{P}_G \\ & = \int_\chi p_g(x)log(1-D(x))dx \end{split}$

where $\mathbb{P}_G$ is the distribution of x <script type="math/tex" id="MathJax-Element-108">x</script>.

This completes the proof.#

MathsXDC

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
MATHEMATICAL PRINCIPLE OF GAN

IntroductionThe potential advantage of deep learning lies in representing complicate probability densities of data with models of large scale hierarchical structure. Till now, the great success of deep
复制链接

扫一扫