MATHEMATICAL PRINCIPLE OF GAN

Mathematical Principle of GAN


Introduction

The potential advantage of deep learning lies in representing complicate probability densities of data with models of large scale hierarchical structure. Till now, the great success of deep learning is mostly in discriminative models that usually map high-dimensional sensible inputs to categorical labels. Training discriminative models benefits a lot from back-propagation algorithm, dropout and ReLU with well-defined differentials. In contrast to discriminative models, however, generative deep models perform inferiorly. This is much due to the intractability of optimising joint likelihood.

Based upon the observation above, Goodfellow proposed generative adversarial net (GAN). As the name implies, GAN consists of two network: generator and discriminator. The generator produces some fake data as real as those genuine data that are assumed to be drawn from some unknown distribution (Only God knows what), while the discriminator tries to improve itself in distinguishing the genuine from the fake. The whole process is like a gambling game.

This formulation is actually a general framework, since both the discriminator and the generator can be a deep model of any kind. For sake of simplicity, in this work, we only consider the multi-layer perceptron. Besides, the data generated by the generator are obtained by transforming a random noise. With this approach, the entire model can be trained using back-propagation and dropout without approximation inference and Markov chain.

Generative Adversarial Net

In this section, we will see GAN in detail and derive its optimisation objective.

As mentioned above, both generator and discriminator are MLPs for simplicity. Philosophically, the generator is expected to something that maps a random noise into a generative data; the discriminator takes input either genuine or fake and outputs the probability of being genuine (drawn from the unknown distribution that only God knows). This whole framework is shown as follows,

这里写图片描述

Suppose we have a training set of m samples, S={x(1),...,x(m)}. Besides, given any probability density pz(z) (with a sufficiently powerful model, the simpler, the better), we can draw m noise samples {z(1),...,z(m)} using the random variable Zpz(z) . Consequently, we can obtain the likelihood

L(x(1),...,x(m),z(1),...,z(m)|θg,θd)=i=1mD(x(i))I{x(i)Data}(1D(x(i)))I{x(i)Data}j=1mD(G(x(j)))I{G(x(j))Data}(1D(G(x(j))))I{G(x(j))Data}=i=1mD(x(i))j=1m(1D(G(x(j))))

Further, the log likelihood is obtained

logL=logi=1mD(x(i))j=1m(1D(G(x(j))))=i=1mlogD(x(i))+j=1mlog(1D(G(z(j))))

According to Large Number Theorem, as m , the empirical loss is exploited to approximate the expectation loss,

logLxpdata(x)[logD(x)]+zpz(z)[log(1D(x))]

Back to our aim: the whole process is tantamount to a gambling game. The generator produces some fake data as real as those genuine data that are assumed to be drawn from some unknown distribution (Only God knows what), while the discriminator tries to improve itself in distinguishing the genuine from the fake. Note that the log likelihood above is the objective with respect to the discriminator D() . Therefore, on one hand, we want to optimize the learnable parameters of the discriminator by maximizing the log likelihood; on the other, we want to optimize those of the generator by minimizing the log likelihood. Then, our optimization objective becomes

minGmaxDV(D,G)=xpdata(x)[logD(x)]+zpz(z)[log(1D(x))]

Theoretical Results

The generator G() implicitly defines a probability distribution with pdf pg . By “implicitly”, I mean G is a mapping from the noise zpz to the fake sample G(z) . Then, it is natural to raise a question: whether pg will reach pdata , the pdf of the unknown distribution only God knows? Or, how much between them if not? The following proposition and theorem will answer this question.

Firstly, given any generator G , we consider the optimal discriminator D.

Proposition 1. For a given G , the optimal discriminator is given by DG(x)=pdata(x)pdata+pg(x)

Proof.
For the given G , we maximize the quantity V(G,D)

V(G,D)=χpdata(x)log(D(x))dx+Ωpz(z)log(1D(G(z)))dz

where, χ and Ω are the sample space of x and z, respectively.

The second term, by x=G(z) , becomes

Ωpz(z)log(1D(G(z)))dz=χpg(x)log(1D(x))dx

This equality is not that straightforward. (Refer to the details in Appendix C: Change of Random Variable in Measure Theory)

Consequently,

V(G,D)=χpdata(x)log(D(x))+pg(x)log(1D(x))dx

Since G is fixed, the unknown generative distribution that God created is thus fixed. This yields the non-zero functions pdata and pg . Our aim is to find D that maximizes V(G,D). To this end, we take variational derivative of D (See Appendix A), set it to 0, and obtain the one that achieves the objective optimization

DG(x)=pdata(x)pdata(x)+pg(x)

This completes the proof. #

By this proposition, we further reformulate this min-max gambling game as minimizing the quantity C(G) , where

C(G)=maxDV(G,D)=xpdata[logpdata(x)pdata(x)+pg(x)]+xpg[logpg(x)pdata(x)+pg(x)]

Now, by presenting Theorem 1, we answer the question raised at the beginning: whether pg will reach pdata , the pdf of the unknown distribution only God knows? Or, how much between them if not?
Theorem 1. The global optimal minimum of C(G) can be obtained at pg=pdata , and the value is log4 .

Proof.

C(G)=xpdata[logpdata(x)pdata(x)+pg(x)]+xpg[logpg(x)pdata(x)+pg(x)]=xpdata[log2pdata(x)pdata(x)+pg(x)log2]+xpg[log2pg(x)pdata(x)+pg(x)log2]=log4+DKL(pdatapdata+pg2)+DKL(pgpdata+pg2)log4

The equality holds iff
pdata=pg=pdata+pg2 . (Refer to Appendix B)
This completes the proof. #

As derived above, the generator is able to duplicate the process of generating data by God.

Formally, we present the GAN training algorithm as below

这里写图片描述

Noticeably, the algorithm presented differs from our optimization objective. Our objective is to optimize wrt D to the end and then optimize wrt G. But, in practice, this will bring prohibitive computational cost and usually result in overfitting. Rather, the algorithm optimizes D for k steps and then optimizes G. This procedure is carried out iteratively till converge. In this way, as long as G changes sufficiently slow, D is always near the optimal solution. Naturally, again, we are quite curious about the quantitative difference between these two procedures. This can be answered by the following proposition.

We present the convergence of the algorithm presented without proof.

Proposition 2. Suppose G and D both have sufficient expressive capabilities and assume that, for a fixed G , D can achieve the optimality, and that pg is optimized according to

xpdata[logDG(x)]+xpg[logDG(x)]

then, pg converges to pdata .

For a better understanding, we take a look at the following pedagogical epigraph.

这里写图片描述

The downmost horizontal line represents the range of the noise and x represents the data range. As is shown, G(z) maps the uniform distribution to a non-uniform distribution. The green curve represents the pdf of x=G(z) , pg ; the black curve represents the pdf of the unknown distribution created by God, pdata ; the blue curve is the decision plane of D . (a) Before convergence, D partly correctly distinguishes the data; (b) While the internal optimisation is being carried out, that is, with the current G fixed, D is optimized. The optimization result is shown in the blue curves; (c) After the internal optimisation is completed, G is optimized with D fixed. We can see the derivative of D drives G to approach the decision plane of D; (d) At last, suppose G and D have sufficient capabilities, they reach an equilibrium, i.e. pg=pdata . Now D cannot distinguish the genuine from the fake, that is, D(x)=12 .

Appendix

A. K-L Divergence

In probability and information theory, K-L divergence, also called information gain, is a “measure” of the distance between two probability densities. Here its formal definition is presented, in discrete form and continuous form, respectively.

Given two probability densities of two discrete random variables, P and Q, the discrete form of K-L divergence is defined as

DKL(PQ):=iP(i)lnP(i)Q(i)=iP(i)[lnP(i)Q(i)]

Given two probability densities of two continuous random variables, p and q, the continuous form of K-L divergence is defined as

DKL(pq):=p(x)lnp(x)q(x)dx=xp(x)[lnp(x)q(x)]

Mathematically, K-L divergence is not a measure, since it violates one of the maxima of measure, i.e. symmetry, that is, DKL(pq)DKL(qp) . The former place, e.g. P (or p), can be construed as the real distribution of data, while the latter place, e.g. Q (or q), can be regarded as an approximation of the real distribution. An alternative to understanding K-L divergence is that DKL(PQ) represents the information gain from the prior Q to the posterior P.

There are a few important properties of K-L divergence:

(1) K-L divergence is well-defined if and only if for some x such that q(x)=0, p(x)=0 holds;

(2) For some x such that p(x)=0, we have p(x)lnp(x)q(x)=0 ;(since limx0xlnx=0 )

(3) DKL(pq)0 . The equality holds iff p=q .

Now we prove the third property.

DKL(pq)=p(x)lnq(x)p(x)dxp(x)(q(x)p(x)1)=0

This completes the proof.#

B. Variational Calculus

Variational calculus is a natural extension of differentiation.

Given a functional F[y]:y(x) , where = or  , in parallel with Taylor expansion, functional expansion is defined as follows, for any η()

F[y(x)+ϵη(x)]=F[y(x)]+ϵ(δFδy)(x)η(x)dx+o(ϵ2)

C. Change of Random Variable in Measure Theory

Proof:

Ωpz(z)log(1D(G(z)))dz=χpg(x)log(1D(x))dx

where x=G(z) .

First, we define a measure space (Ω,,) , where Ω is the sample space of z , is σ -field. Further, note that G() is a measurable function: (Ω,)(χ,) , where χ is the sample space of x , is σ -field of χ . Therefore, we have

Ωpz(z)log(1D(G(z)))dz=Ωlog(1D(G(z)))d(z)=χlog(1D(x))dG1=χlog(1D(x))dG=χpg(x)log(1D(x))dx

where G is the distribution of x <script type="math/tex" id="MathJax-Element-108">x</script>.

This completes the proof.#

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
SQLAlchemy 是一个 SQL 工具包和对象关系映射(ORM)库,用于 Python 编程语言。它提供了一个高级的 SQL 工具和对象关系映射工具,允许开发者以 Python 类和对象的形式操作数据库,而无需编写大量的 SQL 语句。SQLAlchemy 建立在 DBAPI 之上,支持多种数据库后端,如 SQLite, MySQL, PostgreSQL 等。 SQLAlchemy 的核心功能: 对象关系映射(ORM): SQLAlchemy 允许开发者使用 Python 类来表示数据库表,使用类的实例表示表中的行。 开发者可以定义类之间的关系(如一对多、多对多),SQLAlchemy 会自动处理这些关系在数据库中的映射。 通过 ORM,开发者可以像操作 Python 对象一样操作数据库,这大大简化了数据库操作的复杂性。 表达式语言: SQLAlchemy 提供了一个丰富的 SQL 表达式语言,允许开发者以 Python 表达式的方式编写复杂的 SQL 查询。 表达式语言提供了对 SQL 语句的灵活控制,同时保持了代码的可读性和可维护性。 数据库引擎和连接池: SQLAlchemy 支持多种数据库后端,并且为每种后端提供了对应的数据库引擎。 它还提供了连接池管理功能,以优化数据库连接的创建、使用和释放。 会话管理: SQLAlchemy 使用会话(Session)来管理对象的持久化状态。 会话提供了一个工作单元(unit of work)和身份映射(identity map)的概念,使得对象的状态管理和查询更加高效。 事件系统: SQLAlchemy 提供了一个事件系统,允许开发者在 ORM 的各个生命周期阶段插入自定义的钩子函数。 这使得开发者可以在对象加载、修改、删除等操作时执行额外的逻辑。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值