【转载】Product of experts

最新推荐文章于 2023-12-15 14:35:31 发布

weixin_34195142

最新推荐文章于 2023-12-15 14:35:31 发布

阅读量684

点赞数

文章标签：开发工具人工智能

原文链接：http://www.cnblogs.com/daleloogn/p/4442379.html

版权

Product of experts

Max Welling (2007), Scholarpedia, 2(10):3879.

doi:10.4249/scholarpedia.3879

revision #91672 [link to/cite this article]

Post-publication activity

Curator: Max Welling

Dr. Max Welling, School of Information and Computer Science, University of California, Irvine, CA

A Product of Experts model (PoE) (Hinton 2002) combines a number of individual component models (the experts) by taking their product and normalizing the result. Each expert is defined as a possibly unnormalized probabilistic model

P (x | {θ j}) = 1 Z \prod j = 1 M f j ( x | θ j ) (1)

with

Z = \int d x \prod j = 1 M f j (x | θ j)

PoEs stand in contrast to Mixture Models which combine expert models additively,

P (x | {θ j}) = \sum j = 1 M α j p j (x | θ j) (2)

Figure 1: Graphical model representation of the restrictedBoltzmann machine (RBM) and its generalization the exponential family harmonium (EFH). Top layer nodes represent hidden variables while bottom layer nodes represent observed variables. The architecture is that of a bipartite Markov Random Field (MRF).

where each component model

Note that Mixture of Expert Models are usually associated with conditional models where the experts are of the form

One can qualitatively understand the difference between mixtures and products by observing that a mixture distribution can have high probability for event

Put another way, each component in a product represents a soft constraint, while each expert in a mixture represents a soft template or prototype. For an event to be likely under a product model, all constraints must be (approximately) satisfied, while an event is likely under a mixture model if it (approximately) matches with any single template. Hence, to sculpt the overall probability distribution of a mixture, each expert adds a lump of probability mass which is usually localised to one region, while to sculpt the overall probability of a product, each expert scales the probability at each point by a different factor. A product can also be viewed as adding together lumps in the log probability domain. This essential difference can result in much sharper boundaries, especially for high-dimensional input spaces (Hinton 2002).

[hide]

Training a Product of Experts

Given data

L ({θ j} | {x n}) = \sum n = 1 N \sum j = 1 M log f j (x n

Denoting the gradient of the objective w.r.t.

\nabla j L = \sum n = 1 N \nabla j log f j (x n) - N ⟨ \nabla j log f j (x) ⟩

where

θ j \to θ j + η \nabla j L (5)

where

The first term of the gradient (Eqn.(4) can be interpreted as increasing the probability of expert

Contrastive Divergence learning

The simplicity of the gradient in eqn.(4) is deceptive: it requires the evaluation of an intractable average over

⟨ log f j (x) ⟩ P (x) \approx 1 N \sum n f j ( x k n )

where

One can view this approximation as trading variance for bias. Thus, at convergence, we expect that the estimates of the parameters will not be equal to those of maximum likelihood learning, but will be slightly biased. To correct this, one can increase

Restricted Boltzmann Machines and Exponential Family Harmoniums

Perhaps the simplest PoE is given by a restricted Boltzmann machine (see Figure 1). In this model there are two layers of binary (0/1) variables where the bottom layer is observed while the top layer remains unobserved or hidden. The joint probability distribution over hidden and observed variables is given as,

P (x, h) = 1 Z exp ⎛⎝ \sum i α i x i + \sum j β j h j + \sum i j W

where the undirected edges in the graphical model inFigure 1 are representing

P (x) = 1 Z ~ \prod i exp ( α i x i ) \prod j ( 1 + exp ( β j + \sum i

Figure 2: Sampling process for the RBM. Given data at t=0 we sample the hidden variables independently and compute the necessary sufficient statistics for the learning rule. We then reconstruct the data by sampling the visible variables independently given the values for the hidden variables and subsequently sample the hidden variables one more time. If we would keep sampling for a very long time we would obtain samples from the equilibrium distribution.

where elements in the first product represent single-variable experts and elements in the second product represent constraints between the input variables.

The conditional Bernoulli expert distributions can be generalized to distributions in the exponential family . The resulting joint model is called an exponential family harmonium (EFH) (Welling et. al. 2004). The joint distribution can be obtained by replacing

The special bipartite structure of the RBM and EFH results in a very efficient Gibbs sampler that alternates between sampling all hidden variables independently given values for the observed variables and vice versa sampling all visible variables independently given values for the hidden variables. The efficient Gibbs sampler directly translates into an efficient contrastive divergence learning algorithm (see previous section Figure 1).

Relation to Independent and Extreme Components Analysis

Noiseless Independent Components Analysis (ICA) (Comon 1994) with an equal number of input dimensions and source distributions can be written as a PoE model as follows,

P (x | {w j}) = | det (W) | \prod j = 1 M p j (\sum i w

where

elements

Note that each expert,

Choosing heavy tailed Student-T distributions as the experts one obtains the general form of the "Products of Student-T" distribution (PoT) (Welling et. al. 2002). The PoT can be represented with the help of auxiliary variables (taking the role of hidden variables) as follows,

P (x, h) = 1 Z \prod j = 1 M exp ( - h j [ 1 + 1 2 ( \sum i w j i

where

Figure 3: Latent representation for a exponential family harmonium fit to text data. Each point represents a document while its color codes for the hand labelled topic of that document. Each dimension in latent space corresponds to the "activity" of a latent variable. The EFH did not see the labels but managed to organize the documents according to their topics.

The PoT becomes different from ICA if one chooses the number of experts to be larger than the number of input dimensions (a.k.a. an over-complete representation). In this case marginalindependence between the hidden variables is lost, but conditional independence between the hidden variables is retained. Over-complete variants of ICA that retain marginal independence have also been proposed (Lewicki and Sejnowski 1998). Over-complete ICA models have conditional dependencies between the hidden variables known as explaining away which makes inference difficult. In contrast, for the over-complete PoT model inference over the hidden variables given observations is trivial due to the absence of such conditional dependencies (Teh et. al. 2003).

Instead of the non-Gaussian experts used for ICA, one can also choose an under-complete (

Applications of PoEs

Variants of PoEs have been applied under different names to various data-domains: for example therate-coded RBM to face recognition (Teh and Hinton 2001), the dual wing harmonium to video-track data (Xing et. al. 2005), the rate adapting Poisson model (see Figure 1) to text and image data (Gehler et. al 2006), the product of HMMs model to language data (Brown and Hinton 2001), hierarchical versions of PoE to digits (Hinton et. al 2006), text and collaborative filtering data (Salakhutdinov et. al. 2007).

References

C.K.I. Williams and F.V. Agakov. Products of gaussians and probabilistic minor components analysis. Neural Computation, 14(5):1169--1182, 2002.

A. Brown and G. Hinton, Products of hidden Markov models, In Proceedings of the Conference on Artifcial Intelligence and Statistics}, 2001.

M. Carreira-Perpinan and G.E. Hinton, On contrastive divergence learning, Tenth International Workshop on Artificial Intelligence and Statistics, Barbados, 2005.

P. Comon, Independent component analysis, a new concept? Signal Processing, 36:287-314, 1994.

M.E. Tipping and C.M. Bishop, Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B 61(3), 611–622, 1999.

P.V. Gehler, A.D. Holub, and M. Welling, The rate adapting Poisson model for information retrieval and object recognition, ACM, 06 2006.

G.E. Hinton, Training products of experts by minimizing contrastive divergence, Neural Computation, 14:1771--1800, 2002.

G.E. Hinton, S. Osindero, and Y.W. Teh, A fast learning algorithm for deep belief nets, Neural Computation, 18:1527-1554, 2006.

M.S. Lewicki and T.J. Sejnowski, Learning overcomplete representations, Neural Computation, 12:p.337-365, 2000.

D.S. Plaut, S. Nowlan and G.E. Hinton, Experiments on learning by back-propagation, Technical report CMU-CS-86-126, Dept. Comp. Science, CMU, Pittsburgh, PA, 1986.

S. Roweis, EM Algorithms for PCA and SPCA, Advances in Neural Information Processing Systems 10, pp.626-632, 1997.

R.R. Salakhutdinov, A. Mnih, and G.E. Hinton, Restricted Boltzmann machines for collaborative filtering, Proceedings of the 21st International Conference on Machine Learning, 2007.

Y.W. Teh and G.E. Hinton, Rate-coded restricted Boltzmann machines for face recognition, Advances in Neural Information Processing Systems, volume 13, 2001.

Y.W. Teh, M. Welling, S. Osindero, and G.E. Hinton, Energy-based models for sparse overcomplete representations, Journal of Machine Learning Research - Special Issue on ICA, 4:1235--1260, 2003.

M. Welling, F. Agakov, and C.K.I. Williams, Extreme components analysis, Advances in Neural Information Processing Systems, volume 16, Vancouver, Canada, 2003.

M. Welling, G.E. Hinton, and S. Osindero, Learning sparse topographic representations with products of Student-t distributions, Advances in Neural Information Processing Systems, volume 15, Vancouver, Canada, 2002.

M. Welling, M. Rosen-Zvi, and G.E. Hinton, Exponential family harmoniums with an application to information retrieval, Advances in Neural Information Processing Systems, volume 17, Vancouver, Canada, 2004.