Mathematical Principle of GAN


The potential advantage of deep learning lies in representing complicate probability densities of data with models of large scale hierarchical structure. Till now, the great success of deep learning is mostly in discriminative models that usually map high-dimensional sensible inputs to categorical labels. Training discriminative models benefits a lot from back-propagation algorithm, dropout and ReLU with well-defined differentials. In contrast to discriminative models, however, generative deep models perform inferiorly. This is much due to the intractability of optimising joint likelihood.

Based upon the observation above, Goodfellow proposed generative adversarial net (GAN). As the name implies, GAN consists of two network: generator and discriminator. The generator produces some fake data as real as those genuine data that are assumed to be drawn from some unknown distribution (Only God knows what), while the discriminator tries to improve itself in distinguishing the genuine from the fake. The whole process is like a gambling game.

This formulation is actually a general framework, since both the discriminator and the generator can be a deep model of any kind. For sake of simplicity, in this work, we only consider the multi-layer perceptron. Besides, the data generated by the generator are obtained by transforming a random noise. With this approach, the entire model can be trained using back-propagation and dropout without approximation inference and Markov chain.

Generative Adversarial Net

In this section, we will see GAN in detail and derive its optimisation objective.

As mentioned above, both generator and discriminator are MLPs for simplicity. Philosophically, the generator is expected to something that maps a random noise into a generative data; the discriminator takes input either genuine or fake and outputs the probability of being genuine (drawn from the unknown distribution that only God knows). This whole framework is shown as follows,


Suppose we have a training set of m samples, S={x(1),...,x(m)}. Besides, given any probability density pz(z) (with a sufficiently powerful model, the simpler, the better), we can draw m noise samples {z(1),...,z(m)} using the random variable Zpz(z) . Consequently, we can obtain the likelihood


Further, the log likelihood is obtained


According to Large Number Theorem, as m , the empirical loss is exploited to approximate the expectation loss,


Back to our aim: the whole process is tantamount to a gambling game. The generator produces some fake data as real as those genuine data that are assumed to be drawn from some unknown distribution (Only God knows what), while the discriminator tries to improve itself in distinguishing the genuine from the fake. Note that the log likelihood above is the objective with respect to the discriminator D() . Therefore, on one hand, we want to optimize the learnable parameters of the discriminator by maximizing the log likelihood; on the other, we want to optimize those of the generator by minimizing the log likelihood. Then, our optimization objective becomes


Theoretical Results

The generator G() implicitly defines a probability distribution with pdf pg . By “implicitly”, I mean G is a mapping from the noise zpz to the fake sample G(z) . Then, it is natural to raise a question: whether pg will reach pdata , the pdf of the unknown distribution only God knows? Or, how much between them if not? The following proposition and theorem will answer this question.

Firstly, given any generator G , we consider the optimal discriminator D.

Proposition 1. For a given G , the optimal discriminator is given by DG(x)=pdata(x)pdata+pg(x)

For the given G , we maximize the quantity V(G,D)


where, χ and Ω are the sample space of x and z, respectively.

The second term, by x=G(z) , becomes


This equality is not that straightforward. (Refer to the details in Appendix C: Change of Random Variable in Measure Theory)



Since G is fixed, the unknown generative distribution that God created is thus fixed. This yields the non-zero functions pdata and pg . Our aim is to find D that maximizes V(G,D). To this end, we take variational derivative of D (See Appendix A), set it to 0, and obtain the one that achieves the objective optimization


This completes the proof. #

By this proposition, we further reformulate this min-max gambling game as minimizing the quantity C(G) , where


Now, by presenting Theorem 1, we answer the question raised at the beginning: whether pg will reach pdata , the pdf of the unknown distribution only God knows? Or, how much between them if not?
Theorem 1. The global optimal minimum of C(G) can be obtained at pg=pdata , and the value is log4 .



The equality holds iff
pdata=pg=pdata+pg2 . (Refer to Appendix B)
This completes the proof. #

As derived above, the generator is able to duplicate the process of generating data by God.

Formally, we present the GAN training algorithm as below


Noticeably, the algorithm presented differs from our optimization objective. Our objective is to optimize wrt D to the end and then optimize wrt G. But, in practice, this will bring prohibitive computational cost and usually result in overfitting. Rather, the algorithm optimizes D for k steps and then optimizes G. This procedure is carried out iteratively till converge. In this way, as long as G changes sufficiently slow, D is always near the optimal solution. Naturally, again, we are quite curious about the quantitative difference between these two procedures. This can be answered by the following proposition.

We present the convergence of the algorithm presented without proof.

Proposition 2. Suppose G and D both have sufficient expressive capabilities and assume that, for a fixed G , D can achieve the optimality, and that pg is optimized according to


then, pg converges to pdata .

For a better understanding, we take a look at the following pedagogical epigraph.


The downmost horizontal line represents the range of the noise and x represents the data range. As is shown, G(z) maps the uniform distribution to a non-uniform distribution. The green curve represents the pdf of x=G(z) , pg ; the black curve represents the pdf of the unknown distribution created by God, pdata ; the blue curve is the decision plane of D . (a) Before convergence, D partly correctly distinguishes the data; (b) While the internal optimisation is being carried out, that is, with the current G fixed, D is optimized. The optimization result is shown in the blue curves; (c) After the internal optimisation is completed, G is optimized with D fixed. We can see the derivative of D drives G to approach the decision plane of D; (d) At last, suppose G and D have sufficient capabilities, they reach an equilibrium, i.e. pg=pdata . Now D cannot distinguish the genuine from the fake, that is, D(x)=12 .


A. K-L Divergence

In probability and information theory, K-L divergence, also called information gain, is a “measure” of the distance between two probability densities. Here its formal definition is presented, in discrete form and continuous form, respectively.

Given two probability densities of two discrete random variables, P and Q, the discrete form of K-L divergence is defined as


Given two probability densities of two continuous random variables, p and q, the continuous form of K-L divergence is defined as


Mathematically, K-L divergence is not a measure, since it violates one of the maxima of measure, i.e. symmetry, that is, DKL(pq)DKL(qp) . The former place, e.g. P (or p), can be construed as the real distribution of data, while the latter place, e.g. Q (or q), can be regarded as an approximation of the real distribution. An alternative to understanding K-L divergence is that DKL(PQ) represents the information gain from the prior Q to the posterior P.

There are a few important properties of K-L divergence:

(1) K-L divergence is well-defined if and only if for some x such that q(x)=0, p(x)=0 holds;

(2) For some x such that p(x)=0, we have p(x)lnp(x)q(x)=0 ;(since limx0xlnx=0 )

(3) DKL(pq)0 . The equality holds iff p=q .

Now we prove the third property.


This completes the proof.#

B. Variational Calculus

Variational calculus is a natural extension of differentiation.

Given a functional F[y]:y(x) , where = or  , in parallel with Taylor expansion, functional expansion is defined as follows, for any η()


C. Change of Random Variable in Measure Theory



where x=G(z) .

First, we define a measure space (Ω,,) , where Ω is the sample space of z , is σ -field. Further, note that G() is a measurable function: (Ω,)(χ,) , where χ is the sample space of x , is σ -field of χ . Therefore, we have


where G is the distribution of x <script type="math/tex" id="MathJax-Element-108">x</script>.

This completes the proof.#

