【转载】Product of experts

Product of experts

Max Welling (2007), Scholarpedia, 2(10):3879.doi:10.4249/scholarpedia.3879revision #91672 [link to/cite this article]
Post-publication activity

Curator: Max Welling

Product of Experts model (PoE) (Hinton 2002) combines a number of individual component models (the experts) by taking their product and normalizing the result. Each expert is defined as a possibly unnormalized probabilistic model f(x) over its input space.

 

P(x|{θj})=1Zj=1Mfj(x|θj)(1)

 

with 
Z=dx j=1Mfj(x|θj)

PoEs stand in contrast to Mixture Models which combine expert models additively,

 

P(x|{θj})=j=1Mαjpj(x|θj)(2)

 

Figure 1: Graphical model representation of the  restrictedBoltzmann machine (RBM) and its generalization the  exponential family harmonium (EFH). Top layer nodes represent hidden variables while bottom layer nodes represent observed variables. The architecture is that of a bipartite Markov Random Field (MRF).

where each component model pj(x) is normalized over x and Mj=1 αj=1

Note that Mixture of Expert Models are usually associated with conditional models where the experts are of the form p(y|x) and the mixture coefficients α(x) (known as gating functions) may depend on x as well. Conditional PoEs may be defined as well.

One can qualitatively understand the difference between mixtures and products by observing that a mixture distribution can have high probability for event x when only a single expert assigns high probability to that event. In contrast, a product can only have high probability for an event x when no expert assigns an especially low probability to that event. Hence, metaphorically speaking, a single expert in a mixture has the power to pass a bill while a single expert in a product has the power to veto it.

Put another way, each component in a product represents a soft constraint, while each expert in a mixture represents a soft template or prototype. For an event to be likely under a product model, all constraints must be (approximately) satisfied, while an event is likely under a mixture model if it (approximately) matches with any single template. Hence, to sculpt the overall probability distribution of a mixture, each expert adds a lump of probability mass which is usually localised to one region, while to sculpt the overall probability of a product, each expert scales the probability at each point by a different factor. A product can also be viewed as adding together lumps in the log probability domain. This essential difference can result in much sharper boundaries, especially for high-dimensional input spaces (Hinton 2002).

 

Contents

 [hide

Training a Product of Experts

Given data {xn}, n=1..N with xRd , one can use the log-likelihood as an objective function to train a PoE,

 

L({θj}|{xn})=n=1Nj=1Mlogfj(xn|θj)NlogZ(3)

 


Denoting the gradient of the objective w.r.t. θj with jL one can compute the following gradient,

 

jL=n=1Njlogfj(xn)Njlogfj(x)P(x)(4)

 


where P(x) denotes taking the average w.r.t. P(x) . Learning is achieved by changing the parameters incrementally according to the following update rule,

 

θjθj+ηjL(5)

 


where η represents the learning rate. Learning efficiency can usually be improved by using a stochastic approximation of the full gradient based on a single data-case or a few data-cases (a mini-batch). Another effective heuristic to speed up learning is to add a "momentum term" to the gradient (Plaut et al, 1986).

The first term of the gradient (Eqn.(4) can be interpreted as increasing the probability of expert i on the dataset. The second term on the other hand can be interpreted asdecreasing the probability of expert i in regions of input space where the model assigns high probability. When these terms balance, learning has converged to a local maximum of the log-likelihood.

Contrastive Divergence learning

 

The simplicity of the gradient in eqn.(4) is deceptive: it requires the evaluation of an intractable average over P(x) . For most interesting models this average requires methods like MCMC sampling to approximate it. But MCMC sampling is computationally expensive and results in high-variance estimates of the required averages. A cheaper, lower-variance alternative was proposed by (Hinton 2002) under the name contrastive divergence (CD). The idea is to run N samplers in parallel, one for each data-case in the (mini-)batch. These samplers must be initialized at the respective data-cases and will move towards equilibrium using MCMC sampling. After only a few steps of sampling, long before the MCMC converges,there is usually sufficient signal in the population of samples to change the parameters. A surrogate learning rule can be derived by replacing the log-likelihood with a new objective: the contrastive divergence KL(Pdata||Pmodel)KL(Pk||Pmodel) where Pdata is the empirical distribution Pdata=1Nnδ(xxn) , Pmodel is the current estimate of the model distribution and Pk is the distribution based on k steps of sampling. Taking gradients and ignoring a term which is usually very small, the new learning rule is almost identical to the one based on the log-likelihood, but using the approximation

logfj(x)P(x)1Nnfj(xkn)
where  xkn is the sample obtained from MCMC sampler n after k steps of sampling.

 

One can view this approximation as trading variance for bias. Thus, at convergence, we expect that the estimates of the parameters will not be equal to those of maximum likelihood learning, but will be slightly biased. To correct this, one can increase k close to convergence (Carreira-Perpinan and Hinton 2005).

Restricted Boltzmann Machines and Exponential Family Harmoniums

 

Perhaps the simplest PoE is given by a restricted Boltzmann machine (see Figure 1). In this model there are two layers of binary (0/1) variables where the bottom layer is observed while the top layer remains unobserved or hidden. The joint probability distribution over hidden and observed variables is given as,

 

P(x,h)=1Zexp⎛⎝iαixi+jβjhj+ijWijxihj⎞⎠(6)

 


where the undirected edges in the graphical model inFigure 1 are representing {Wij} . The bias terms are parameterized by {αi,βj} . Marginalizing over {hj} the PoE structure becomes evident,

 

P(x)=1Z~iexp(αixi)j(1+exp(βj+iWijxi))(7)

 

 

Figure 2: Sampling process for the RBM. Given data at t=0 we sample the hidden variables independently and compute the necessary sufficient statistics for the learning rule. We then reconstruct the data by sampling the visible variables independently given the values for the hidden variables and subsequently sample the hidden variables one more time. If we would keep sampling for a very long time we would obtain samples from the equilibrium distribution.

where elements in the first product represent single-variable experts and elements in the second product represent constraints between the input variables.

The conditional Bernoulli expert distributions can be generalized to distributions in the exponential family . The resulting joint model is called an exponential family harmonium (EFH) (Welling et. al. 2004). The joint distribution can be obtained by replacing xifi(xi) and hjgj(hj) . where f() and g() are the features for the corresponding exponential family distribution.

The special bipartite structure of the RBM and EFH results in a very efficient Gibbs sampler that alternates between sampling all hidden variables independently given values for the observed variables and vice versa sampling all visible variables independently given values for the hidden variables. The efficient Gibbs sampler directly translates into an efficient contrastive divergence learning algorithm (see previous section Figure 1).

Relation to Independent and Extreme Components Analysis

 

Noiseless Independent Components Analysis (ICA) (Comon 1994) with an equal number of input dimensions and source distributions can be written as a PoE model as follows,

 

P(x|{wj})=|det(W)| j=1Mpj(iwjixi)(8)

 

where W is the matrix with

elements wji .

Note that each expert, pj , is defined as a distribution on a one-dimensional projection of the input space. One can think of each projection as a "source" and a linear combination of these sources generated the signal. Unless W is rank deficient, the product is a well defined distribution over the entire input space.

Choosing heavy tailed Student-T distributions as the experts one obtains the general form of the "Products of Student-T" distribution (PoT) (Welling et. al. 2002). The PoT can be represented with the help of auxiliary variables (taking the role of hidden variables) as follows,

 

P(x,h)=1Zj=1Mexp(hj[1+12(iwjixi)2]+(1αj)loghj)(9)

 


where P(x|h) is a full covariance Gaussian distribution and P(h|x) a product of Gamma distributions.

Figure 3: Latent representation for a exponential family harmonium fit to text data. Each point represents a document while its color codes for the hand labelled topic of that document. Each dimension in latent space corresponds to the "activity" of a latent variable. The EFH did not see the labels but managed to organize the documents according to their topics.

The PoT becomes different from ICA if one chooses the number of experts to be larger than the number of input dimensions (a.k.a. an over-complete representation). In this case marginalindependence between the hidden variables is lost, but conditional independence between the hidden variables is retained. Over-complete variants of ICA that retain marginal independence have also been proposed (Lewicki and Sejnowski 1998). Over-complete ICA models have conditional dependencies between the hidden variables known as explaining away which makes inference difficult. In contrast, for the over-complete PoT model inference over the hidden variables given observations is trivial due to the absence of such conditional dependencies (Teh et. al. 2003).

Instead of the non-Gaussian experts used for ICA, one can also choose an under-complete (M<d) set of one-dimensional Gaussian experts, i.e. pj(iwjixi) with pj() Gaussian. Using the fact that the inverse-covariance of the product is equal to the sum of the inverse-covariances of the individual Gaussian experts one can formulate probabilistic principal component analysis (Roweis, 1997; Tipping and Bishop, 1999) or probabilistic minor component analysis (Williams and Agakov, 2002). However, it is also possible to formulate a model that extracts the optimal combination of principal and minor components in the spectrum of the sample covariance matrix. The probabilistic model, known as ``eXtreme Components Analysis (XCA), is described in (Welling et. al. 2003).

Applications of PoEs

Variants of PoEs have been applied under different names to various data-domains: for example therate-coded RBM to face recognition (Teh and Hinton 2001), the dual wing harmonium to video-track data (Xing et. al. 2005), the rate adapting Poisson model (see Figure 1) to text and image data (Gehler et. al 2006), the product of HMMs model to language data (Brown and Hinton 2001), hierarchical versions of PoE to digits (Hinton et. al 2006), text and collaborative filtering data (Salakhutdinov et. al. 2007).

References

  • C.K.I. Williams and F.V. Agakov. Products of gaussians and probabilistic minor components analysis. Neural Computation, 14(5):1169--1182, 2002.
  • A. Brown and G. Hinton, Products of hidden Markov models, In Proceedings of the Conference on Artifcial Intelligence and Statistics}, 2001.
  • M. Carreira-Perpinan and G.E. Hinton, On contrastive divergence learning, Tenth International Workshop on Artificial Intelligence and Statistics, Barbados, 2005.
  • P. Comon, Independent component analysis, a new concept? Signal Processing, 36:287-314, 1994.
  • M.E. Tipping and C.M. Bishop, Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B 61(3), 611–622, 1999.
  • P.V. Gehler, A.D. Holub, and M. Welling, The rate adapting Poisson model for information retrieval and object recognition, ACM, 06 2006.
  • G.E. Hinton, Training products of experts by minimizing contrastive divergence, Neural Computation, 14:1771--1800, 2002.
  • G.E. Hinton, S. Osindero, and Y.W. Teh, A fast learning algorithm for deep belief nets, Neural Computation, 18:1527-1554, 2006.
  • M.S. Lewicki and T.J. Sejnowski, Learning overcomplete representations, Neural Computation, 12:p.337-365, 2000.
  • D.S. Plaut, S. Nowlan and G.E. Hinton, Experiments on learning by back-propagation, Technical report CMU-CS-86-126, Dept. Comp. Science, CMU, Pittsburgh, PA, 1986.
  • S. Roweis, EM Algorithms for PCA and SPCA, Advances in Neural Information Processing Systems 10, pp.626-632, 1997.
  • R.R. Salakhutdinov, A. Mnih, and G.E. Hinton, Restricted Boltzmann machines for collaborative filtering, Proceedings of the 21st International Conference on Machine Learning, 2007.
  • Y.W. Teh and G.E. Hinton, Rate-coded restricted Boltzmann machines for face recognition, Advances in Neural Information Processing Systems, volume 13, 2001.
  • Y.W. Teh, M. Welling, S. Osindero, and G.E. Hinton, Energy-based models for sparse overcomplete representations, Journal of Machine Learning Research - Special Issue on ICA, 4:1235--1260, 2003.
  • M. Welling, F. Agakov, and C.K.I. Williams, Extreme components analysis, Advances in Neural Information Processing Systems, volume 16, Vancouver, Canada, 2003.
  • M. Welling, G.E. Hinton, and S. Osindero, Learning sparse topographic representations with products of Student-t distributions, Advances in Neural Information Processing Systems, volume 15, Vancouver, Canada, 2002.
  • M. Welling, M. Rosen-Zvi, and G.E. Hinton, Exponential family harmoniums with an application to information retrieval, Advances in Neural Information Processing Systems, volume 17, Vancouver, Canada, 2004.
  • E. Xing, R. Yan, and A. Hauptman, Mining associated text and images with dual-wing harmoniums, Proc. Uncertainty in Artificial Intelligence 2005.

Internal references

See also

Independent Component AnalysisMixture of ExpertsPattern Recognition

转载于:https://www.cnblogs.com/daleloogn/p/4442379.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值