Product of experts

A Product of Experts model (PoE) (Hinton 2002) combines anumber of individual component models (the experts) by takingtheir product and normalizing the result. Each expert is definedas a possibly unnormalized probabilistic model f(x) over its input space.

P(x|{θj})=1Zj=1Mfj(x|θj)(1)
with 
 
 
Z=dx j=1Mfj(x|θj)

PoEs stand in contrast to Mixture Models which combine expertmodels additively,

P(x|{θj})=j=1Mαjpj(x|θj)(2)
Figure 1: Graphical model representation of the restricted Boltzmann machine (RBM) and its generalization the exponential family harmonium (EFH). Top layer nodes represent hidden variables while bottom layer nodes represent observed variables. The architecture is that of a bipartite Markov Random Field (MRF).

where each component model pj(x) is normalized over x and Mj=1 αj=1

Note that Mixture of Expert Models are usually associated withconditional models where the experts are of the form p(y|x) and the mixture coefficients α(x) (known as gatingfunctions) may depend on x as well. Conditional PoEs may be defined as well.

One can qualitatively understand the difference between mixturesand products by observing that a mixture distribution can havehigh probability for event x when only a singleexpert assigns high probability to that event. In contrast, aproduct can only have high probability for an event x when no expert assigns an especially low probability to that event. Hence,metaphorically speaking, a single expert in a mixture has thepower to pass a bill while a single expert in a product has thepower to veto it.

Put another way, each component in a product represents a softconstraint, while each expert in a mixture represents a softtemplate or prototype. For an event to be likely under a productmodel, all constraints must be (approximately) satisfied, while anevent is likely under a mixture model if it (approximately)matches with any single template. Hence, to sculpt the overall probabilitydistribution of a mixture, each expert adds a lump ofprobability mass which is usually localised to one region, while to sculptthe overall probability of a product, each expert scales the probability at each point by a different factor. A product can also be viewed as adding together lumps in the log probability domain.This essential difference can result in much sharperboundaries, especially for high-dimensional input spaces(Hinton 2002).


Training a Product of Experts

Given data {xn}, n=1..N with xRd , one can use the log-likelihood as an objectivefunction to train a PoE,

L({θj}|{xn})=n=1Nj=1Mlogfj(xn|θj)NlogZ(3)


Denoting the gradient of theobjective w.r.t. θj with jL one can compute the following gradient,

jL=n=1Njlogfj(xn)Njlogfj(x)P(x)(4)


where P(x) denotes taking the averagew.r.t. P(x) . Learning is achieved by changing theparameters incrementally according to the following update rule,

θjθj+ηjL(5)


where η represents the learning rate. Learningefficiency can usually be improved by using a stochasticapproximation of the full gradient based on a single data-case ora few data-cases (a mini-batch). Another effective heuristic to speed up learningis to add a "momentum term" to the gradient (Plaut et al, 1986).

The first term of the gradient (Eqn.(4) canbe interpreted asincreasing the probability of expert i on the dataset. The second term on the other handcan be interpreted as decreasing the probability of expert i in regions of input space where the model assignshigh probability. When these terms balance, learning has convergedto a local maximum of the log-likelihood.

Contrastive Divergence learning

The simplicity of the gradient in eqn.(4) isdeceptive: it requires the evaluation of an intractable averageover P(x) . For most interesting models this averagerequires methods like MCMC sampling to approximate it.But MCMC sampling is computationally expensive and results inhigh-variance estimates of the required averages. A cheaper,lower-variance alternative was proposed by (Hinton 2002) underthe name contrastive divergence (CD). The idea is to run N samplers in parallel, one for each data-case in the(mini-)batch. These samplers must be initialized at the respectivedata-cases and will move towardsequilibrium using MCMC sampling. After only a few steps of sampling, long before the MCMC converges,there isusually sufficient signal in the population of samples to change theparameters. A surrogate learning rule can be derived by replacingthe log-likelihood with a new objective: the contrastivedivergence KL(Pdata||Pmodel)KL(Pk||Pmodel) where Pdata is the empirical distribution Pdata=1Nnδ(xxn) , Pmodel is the current estimate of the modeldistribution and Pk is the distribution based on k steps of sampling. Taking gradients and ignoring aterm which is usually very small, the new learning rule is almostidentical to the one based on the log-likelihood, but using the approximation

logfj(x)P(x)1Nnfj(xkn)
where xkn is the sampleobtained from MCMC sampler n after k steps of sampling.

One can view this approximation as trading variance for bias.Thus, at convergence, we expect that the estimates of theparameters will not be equal to those of maximum likelihood learning, butwill be slightly biased. To correct this, one can increase k close to convergence(Carreira-Perpinan and Hinton 2005).

Restricted Boltzmann Machines and Exponential Family Harmoniums

Perhaps the simplest PoE is given by a restricted Boltzmannmachine (see Figure1). In this model there aretwo layers of binary (0/1) variables where the bottom layer isobserved while the top layer remains unobserved or hidden. Thejoint probability distribution over hidden and observed variablesis given as,

P(x,h)=1Zexpiαixi+jβjhj+ijWijxihj(6)


where the undirected edges in the graphical model inFigure 1 are representing {Wij} . The bias terms are parameterized by {αi,βj} . Marginalizing over {hj} the PoE structure becomes evident,

P(x)=1Z~iexp(αixi)j(1+exp(βj+iWijxi))(7)


Figure 2: Sampling process for the RBM. Given data at t=0 we sample the hidden variables independently and compute the necessary sufficient statistics for the learning rule. We then reconstruct the data by sampling the visible variables independently given the values for the hidden variables and subsequently sample the hidden variables one more time. If we would keep sampling for a very long time we would obtain samples from the equilibrium distribution.

where elements in the first product representsingle-variable experts and elements in the second productrepresent constraints between the input variables.

The conditional Bernoulli expert distributions can be generalizedto distributions in the exponential family . The resulting jointmodel is called anexponential family harmonium (EFH)(Welling et. al. 2004). The joint distribution can beobtained by replacing xifi(xi) and hjgj(hj) . where f() and g() are the features for the correspondingexponential family distribution.

The special bipartite structure of the RBM and EFH results in avery efficient Gibbs sampler that alternates between sampling allhidden variables independently given values for the observedvariables and vice versa sampling all visible variablesindependently given values for the hidden variables. The efficientGibbs sampler directly translates into an efficientcontrastivedivergence learning algorithm (see previous section Figure 1).

Relation to Independent and Extreme Components Analysis

Noiseless Independent Components Analysis (ICA)(Comon 1994) with an equal number of input dimensions and sourcedistributions can be written as a PoE model as follows,

P(x|{wj})=|det(W)| j=1Mpj(iwjixi)(8)
where 
  
  
   
   W
  
   is the matrix with

elements wji .

Note that each expert, pj , is defined as a distribution on a one-dimensional projection of the input space. One can think of each projection as a "source" and a linear combination of thesesources generated the signal.Unless W is rank deficient, the product is a welldefined distribution over the entire input space.

Choosing heavy tailed Student-T distributions as the expertsone obtains the general form of the "Products of Student-T"distribution (PoT) (Welling et. al. 2002). The PoT can berepresented with the help of auxiliary variables (taking the roleof hidden variables) as follows,

P(x,h)=1Zj=1Mexp(hj[1+12(iwjixi)2]+(1αj)loghj)(9)


where P(x|h) isa full covariance Gaussian distribution and P(h|x) aproduct of Gamma distributions.

Figure 3: Latent representation for a exponential family harmonium fit to text data. Each point represents a document while its color codes for the hand labelled topic of that document. Each dimension in latent space corresponds to the "activity" of a latent variable. The EFH did not see the labels but managed to organize the documents according to their topics.

The PoT becomes different from ICA if one chooses the number ofexperts to be larger than the number of input dimensions (a.k.a.anover-complete representation). In this case marginalindependence between the hidden variables is lost, but conditional independence between the hidden variables isretained. Over-complete variants of ICA that retain marginalindependence have also been proposed (Lewicki and Sejnowski 1998).Over-complete ICA models have conditional dependencies between the hiddenvariables known asexplaining away which makes inferencedifficult. In contrast, for the over-complete PoT model inferenceover the hidden variables given observations is trivial due to theabsence of such conditional dependencies(Teh et. al. 2003).

Instead of the non-Gaussian experts used for ICA, one can alsochoose an under-complete ( M<d ) set ofone-dimensional Gaussian experts, i.e. pj(iwjixi) with pj() Gaussian. Using the fact that the inverse-covariance of the product is equal to the sum of theinverse-covariances of the individual Gaussian experts one can formulateprobabilistic principal component analysis (Roweis, 1997; Tipping and Bishop, 1999) or probabilistic minorcomponent analysis (Williams and Agakov, 2002). However, it is also possible to formulate a model that extracts the optimal combination of principal and minor components in the spectrum ofthe sample covariance matrix. The probabilistic model, known as``eXtreme Components Analysis (XCA), is described in(Welling et. al. 2003).

Applications of PoEs

Variants of PoEs have been applied under different names tovarious data-domains: for example therate-coded RBM to facerecognition (Teh and Hinton 2001), the dual wing harmonium tovideo-track data (Xing et. al. 2005), therate adaptingPoisson model (see Figure 1) to text and imagedata (Gehler et. al 2006), the product of HMMs model tolanguage data (Brown and Hinton 2001),hierarchical versions ofPoE to digits (Hinton et. al 2006), text and collaborative filtering data(Salakhutdinov et. al. 2007).

References

  • C.K.I. Williams and F.V. Agakov. Products of gaussians and probabilistic minor components analysis.Neural Computation, 14(5):1169--1182, 2002.
  • A. Brown and G. Hinton, Products of hidden Markov models, In Proceedings of the Conference on Artifcial Intelligence and Statistics}, 2001.
  • M. Carreira-Perpinan and G.E. Hinton, On contrastive divergence learning, Tenth International Workshop on Artificial Intelligence and Statistics, Barbados, 2005.
  • P. Comon, Independent component analysis, a new concept? Signal Processing, 36:287-314, 1994.
  • M.E. Tipping and C.M. Bishop, Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B 61(3), 611–622, 1999.
  • P.V. Gehler, A.D. Holub, and M. Welling, The rate adapting Poisson model for information retrieval and object recognition, ACM, 06 2006.
  • G.E. Hinton, Training products of experts by minimizing contrastive divergence, Neural Computation, 14:1771--1800, 2002.
  • G.E. Hinton, S. Osindero, and Y.W. Teh, A fast learning algorithm for deep belief nets, Neural Computation, 18:1527-1554, 2006.
  • M.S. Lewicki and T.J. Sejnowski, Learning overcomplete representations, Neural Computation, 12:p.337-365, 2000.
  • D.S. Plaut, S. Nowlan and G.E. Hinton, Experiments on learning by back-propagation, Technical report CMU-CS-86-126, Dept. Comp. Science, CMU, Pittsburgh, PA, 1986.
  • S. Roweis, EM Algorithms for PCA and SPCA, Advances in Neural Information Processing Systems 10, pp.626-632, 1997.
  • R.R. Salakhutdinov, A. Mnih, and G.E. Hinton, Restricted Boltzmann machines for collaborative filtering, Proceedings of the 21st International Conference on Machine Learning, 2007.
  • Y.W. Teh and G.E. Hinton, Rate-coded restricted Boltzmann machines for face recognition, Advances in Neural Information Processing Systems, volume 13, 2001.
  • Y.W. Teh, M. Welling, S. Osindero, and G.E. Hinton, Energy-based models for sparse overcomplete representations, Journal of Machine Learning Research - Special Issue on ICA, 4:1235--1260, 2003.
  • M. Welling, F. Agakov, and C.K.I. Williams, Extreme components analysis, Advances in Neural Information Processing Systems, volume 16, Vancouver, Canada, 2003.
  • M. Welling, G.E. Hinton, and S. Osindero, Learning sparse topographic representations with products of Student-t distributions, Advances in Neural Information Processing Systems, volume 15, Vancouver, Canada, 2002.
  • M. Welling, M. Rosen-Zvi, and G.E. Hinton, Exponential family harmoniums with an application to information retrieval, Advances in Neural Information Processing Systems, volume 17, Vancouver, Canada, 2004.
  • E. Xing, R. Yan, and A. Hauptman, Mining associated text and images with dual-wing harmoniums, Proc. Uncertainty in Artificial Intelligence 2005.

Internal references

原文地址:http://www.scholarpedia.org/article/Product_of_experts

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值