Nonlinear Dimensionality Reduction by Locally Linear Embedding[论文翻译]

Yzhang98

于 2020-08-20 08:33:38 发布

阅读量461

点赞数

文章标签：神经网络计算机视觉机器学习

本文链接：https://blog.csdn.net/zxggghjju/article/details/107811999

版权

Nonlinear Dimensionality Reduction by Locally Linear Embedding

Sam T. Roweis ¹ and Lawrence K. Saul²

Many areas of science depend on exploratory data analysis and visualization.The need to analyze large amounts of multivariate data raises the fundamental problem of dimensionality reduction:how to discover compact representations of high-dimensional data.Here, we introduce locally linearembedding(LLE), an unsupervised learning algorithm that computes low-dimensional,neighborhood-preserving embeddings of high-dimensional inputs.Unlike clustering methods for local dimensionality reduction,LLE maps its inputs into a single global coordinate system of lower dimensionality, and its optimizations do not involve local minima.By exploiting the local symmetries of linear reconstructions,LLE is able to learn the global structure of nonlinear manifolds,such as those generated by images of faces or documents of text.

许多科学领域都依赖于探索性数据分析和可视化.分析大量多变量数据的需求提出了降维的基本问题:如何发现高维数据的紧凑表征.在这里,我们介绍局部线性嵌入(LLE),一种无监督学习算法,它可以计算高维输入的低维、邻域保护嵌入。与聚类方法的局部维度降低不同，LLE将输入映射到一个较低维度的全局坐标系中，并且它的优化不涉及局部最小值，通过利用线性重建的局部对称性，LLE能够学习非线性表象的全局结构，例如那些由人脸图像或文本文档生成的表象。

How do we judge similarity? Our mental representations of the world are formed by processing large numbers of sensory inputs–including, for example,the pixel intensities of images，the power spectra of sounds，and the joint angles of articulated bodies.While complex stimuli of this form can be represented by points in a high-dimensional vector space,they typically have a much more compact description.Coherent structure in the world leads to strong correlations between inputs (such as between neighboring pixels in images)，gencrating observations that lie on or close to a smooth low-dimensional manifold. To compare and classify such observations-ineffect，to reason about the world——dependscrucially on modeling the nonlinear geometry of these low-dimensional manifolds.

Scientists interested in exploratory analysis or visualization of multivariate data face a similar problem in dimensionality reduction.The problem,as illustrated in Fig.1,involves mapping high-dimensional inputs into a low-dimensional “description” space with as many coordinates as observed modes of variability.Previous approaches to this problem,based on multidimensional scaling (MDS)(2),have computed embeddings that attempt to preserve pairwise distances [or generalized disparities(3)]between data points;these distances are
measured along straight lines or,in more sophisticated usages of MDS such as Isomap (4),along shortest paths confined to the manifold of observed inputs.Here,we take a different approach, called locally linear embedding (LLE),that eliminates the need to estimate pairwise distances between widely separated data points.Unlike previous methods,LLE recovers global nonlinear structure from locally linear fits.

我们如何判断相似性？我们对世界的心理表征是通过处理大量的感官输入而形成的，例如，包括图像的像素强度，声音的功率谱以及关节的关节角度。用高维向量空间中的点表示，它们通常具有更紧凑的描述。世界上的相干结构导致输入之间（例如，图像中相邻像素之间）的强烈相关性，使位于或接近于观测点的观测值更加粗糙。光滑的低维流形。为了比较和分类这种观察效果，以推理出关于世界的原因，尤其是对这些低维流形的非线性几何建模。

对多变量数据的探索性分析或可视化感兴趣的科学家们也面临着类似的降维问题。如图1所示，这个问题涉及到将高维输入映射到一个低维的 "描述 "空间，其坐标数与观测到的变化模式一样多。以前的方法基于多维缩放,已经计算出了试图保留数据点之间的对偶距离或广义差异的嵌入；这些距离是沿着直线测量的,或者,在更复杂的MDS用法中,如Isomap,是沿着限制在观察到的输入歧管中的最短路径测量的。在这里,我们采用了一种不同的方法,称为局部线性嵌入,它消除了估计广泛分离的数据点之间的对偶距离的需要。
在这里插入图片描述
Fig.1.The problem of nonlinear dimensionality reduction, as illustrated (10) for three-dimensional data(B) sampled from a two-dimensional manifold(A).An unsupervised learning algorithm must discover the global internal coordinates of the manifold without signals that explicitly indicate how the data should be embedded in two dimensions.The color coding illustrates the neighborhood-preserving mapping discovered by LLE; black outlines in (B) and © show the neighborhood of a single point.unlike LLE, projections of the data by principal component analysis(PCA)(28) or classical MDS(2) map faraway data points to nearby points in the plane,failing to identify the
underlying structure of the manifold.Note that mixture models for local dimensionality reduction (29),which cluster the data and perform PCA within each cluster,do not address the problem considered here: namely,how to map high-dimensional data into a single global coordinate system
非线性降维的问题，如图(10)所示，对于从二维流形(A)中采样的三维数据(B).一个无监督学习算法必须发现流形的全局内部坐标，而没有明确指示数据应该如何嵌入二维的信号。颜色编码说明了LLE发现的保留邻域的映射。(B）和（C）中的黑色轮廓线显示了单个点的邻域。与LLE不同，通过主成分分析或经典MDS进行的数据投影将遥远的数据点映射到附近的点,无法识别流形的基础结构。请注意，用于局部降维的混合模型将数据聚类并在每个聚类中执行PCA，但并未解决此处考虑的问题：即如何将高维数据映射到单个全局坐标系中

The LLE algorithm, summarized in Fig.2,is based on simple geometric intuitions.Suppose the data consist of N real-valued vectors $\vec{X_i}$ each of dimensionality D, sampled from some underlying manifold.Provided there is sufficient data (such that the manifold is well-sampled),we expect each data point and its neighbors to lie on or close to a locally linear patch of the manifold.We characterize the local geometry of these patches by linear coefficients that reconstruct each data point from its neighbors. Reconstruction errors are measured by the cost function
$\varepsilon (w)=\sum_{i}\left | \vec{X_i} -\sum_{j}W_{ij}\vec{X_j}\right |^2$ (1)
which adds up the squared distances between all the data points and their reconstructions.The weights W_ij summarize the contribution of the jth data point to the ith reconstruction.To compute the weights W，we minimize the cost function subject to two constraints: first，that each data point $\vec{X_i}$ is reconstructed only from its neighbors (5), enforcing W_ij= 0 if $\vec{X_j}$ does not belong to the set of neighbors of $\vec{X_i}$ ; second，that the rows of the weight matrix sum to one: $\sum_{j}W_{ij}=1$ The optimal weights W_ij subject to these constraints (6) are found by solving a least-squares problem (7).

LLE算法总结于图2中，基于简单的几何直觉，假设数据由N个实值向量 $\vec{X_i}$ 组成，每个向量均来自维数D，并从一些基础流形上采样。只要有足够的数据，就可以对流形进行很好的采样，我们希望每个数据点及其邻域都位于流形的局部线性斑块上或附近。我们通过线性系数来表征这些斑块的局部几何形状，这些线性系数可以从它的邻居。重建误差由成本函数衡量
$\varepsilon (w)=\sum_{i}\left | \vec{X_i} -\sum_{j}W_{ij}\vec{X_j}\right |^2$ (1)
它将所有数据点和它们的重建之间的平方距离相加。权重W_ij总结了第j个数据点对第i个重构的贡献。为了计算权重W，我们在两个约束下最小化了成本函数：首先，每个数据点 $\vec{X_i}$ 仅从其邻居重构，如果 $\vec{X_j}$ 不属于 $\vec{X_i}$ 邻居集合，则强制W_ij= 0 {X_i} $; 第二，权重矩阵的行总和为：$ \sum_{j}W_{ij}=1$，在这些约束条件下，通过求解最小二乘问题找到最优权重W_ij。

在这里插入图片描述
Fig. 2.Steps of locally linear embedding: (1)Assign neighbors to each data point $\vec{X_i}$ (for example by using the K nearest neighbors).(2)Compute the weights W_ij that best linearly recorstruct $\vec{X_i}$ from its neighborssolving the constrained least-squares
problem in Eq.1.(3) Compute the low-dimensional embedding vectors $\vec{Y_i}$ best reconstructed by w_ij minimizing Eq. 2 by finding the smallest eigenmodes of the sparse symmetric matrix in Eq.3.Although the weights W_ij and vectors $\vec{Y_i}$ are computed by methods in linear algebra,the constraint that points are only reconstructed from neighbors can result in highly nonlinear embeddings.
图2.局部线性嵌入的步骤. (1)给每个数据点 $vec{X_i}$ 分配邻居,例如使用K个最近的邻居。(2)计算权重W_ij，从其邻域中最好地线性重构 $\vec{X_i}$ ，求解式1中的约束最小二乘问题。(3)通过寻找式3中稀疏对称矩阵的最小特征模，计算由w_ij最小化式2所重建的低维嵌入向量 $\vec{Y_i}$ 最佳。尽管权重W_ij和向量 $\vec{Y_i}$ 是通过线性代数的方法计算的,但由于点只能从邻点重建,因此会导致高度非线性的嵌入。

The constrained weights that minimize these reconstruction errors obey an important symmetry: for any particular data point,they are invariant to rotations,rescalings,and translations of that data point and its neighbors.By symmetry, it follows that the reconstruction weights characterize intrinsic geometric properties of each neighborhood,as opposed to properties that depend on a particular frame of reference(8).Note that the invariance to translations is specifically enforced by the sum-to-one constraint on the rows of the weight matrix.

最小化这些重建误差的约束权重遵循重要的对称性：对于任何特定数据点，它们对于该数据点及其邻域的旋转，缩放和平移都是不变的。
通过对称性可以得出，重建权重表征了每个邻域的内在几何特性，而不是依赖于特定参考框架的特性。请注意，平移不变性是特别受求和约束的约束在权重矩阵的行上。
Suppose the data lie on or near a smooth nonlinear manifold of lower dimensionality d<<D.To a good approximation then，there exists a linear mapping–consisting of a translation,rotation,and rescaling-that maps the high-dimensional coordinates of each neighborhood to global internal coordinates on the manifold.By design,the reconstruction weights W_ij reflect intrinsic geometric properties of the data that are invariant to exactly such transformations. We therefore expect their characterization of local geometry in the original data space to be equally valid for local patches on the manifold. In particular，the same weights W_ij that reconstruct the ith data point in D dimensions
should also reconstruct its embedded manifold coordinates in d dimensions.
假设数据位于一个低维度d<<D的光滑非线性流形上或附近，那么，存在一个线性映射，包括平移、旋转和重新缩放，将每个邻域的高维坐标映射到流形的全局内部坐标。通过设计，重建权重W_ij反映了数据的内在几何属性，而这些属性恰恰对这种变换是不变的。因此，我们期望它们对原始数据空间中局部几何的描述同样适用于流形上的局部斑点。特别是，在D维度上重构第i个数据点的权重W_ij也是一样的。
还应在d维度上重构其内嵌的流形坐标。

LLE constructs a neighborhood-preserving mapping based on the above idea.In the final step of the algorithm，each high-dimensional
observation $\vec{X_i}$ is mapped to a low-dimensional vector $\vec{Y_i}$ representing global internal coordinates on the manifold. This is done by choosing d-dimensional coordinates $\vec{X_i}$ to minimize the embedding cost function
$\Phi (Y)=\sum_{i}\left | \vec{Y_i}-\sum_{j}W_{ij}\vec{Y} \right |^2$

This cost function,like the previous one,is based on locally linear reconstruction errors,but here we fix the weights W_ij while optimizing the coordinates .The embedding cost in Eg. 2 defines a quadratic form in the vectors $\vec{Y_i}$ .Subject to constraints that make the problem well-posed,it can be minimized by solving a sparse N ×N eigenvalue problem(9)，whose bottom d nonzero eigenvectors provide an ordered set of orthogonal coordinates centered on the origin.

该成本函数与前一个函数一样,基于局部线性重构误差,但这里我们在优化坐标的同时,固定权重W_ij.Eg.2中的嵌入成本在向量 $\vec{Y_i}$ 中定义了一个二次函数形式.在约束条件下,使问题很好求解,它可以通过解决一个稀疏的N×N特征值问题(来最小化,该问题的底部d个非零特征向量提供了一个以原点为中心的有序的正交坐标集。

Implementation of the algorithm is straightforward. In our experiments，data points were reconstructed from their K nearest neighbors，as measured by Euclidean distance or normalized dot products. For such implementations of LLE，the algorithm has only one free parameter: the number of neighbors,K.Once neighbors are chosen,the optimal weights W_ij and coordinates $\vec{Y_i}$ are computed by standard methods in linear algebra.The algorithm involves a single pass through the three steps in Fig. 2 and finds global minima of the reconstruction and embedding costs in Eqs. 1 and 2.

该算法的实现是直接的，在我们的实验中，数据点由K个最近的邻接点重建，用欧氏距离或归一化点积来衡量。对于这种LLE的实现,算法只有一个自由参数:邻接点的数量,K.一旦选择了邻接点,就可以通过线性代数的标准方法计算出最佳权重W_ij和坐标 $\vec{Y_i}$ .算法只需通过图2中的三个步骤,并在公式1和2中找到重建和嵌入成本的全局最小值。

In addition to the example in Fig. 1，for which the true manifold structure was known(10), we also applied LLE to images of faces(11) and vectors of word-document counts(12).Two-dimensional embeddings of faces and words are shown in Figs. 3 and 4.Note how the coordinates of these embedding spaces are related to meaningful attributes,such as the pose and expression of human faces and the semantic associations of words.
除了图1中已知真实的流形结构的例子外,我们还将LLE应用于人脸图像和单词文档数量的向量.人脸和单词的二维嵌入如图3和图4所示.请注意这些嵌入空间的坐标是如何与有意义的属性相关联的，例如人脸的姿势和表情以及词语的语义关联。
在这里插入图片描述
Fig.4.Arranging words in a continuous semantic space.Each word was initially represented by a high-dimensional vector that counted
the number of times it appeared in different encyclopedia articles.LLE was applied to these word-document count vectors (12). resulting in an embedding location for each word.Shown are words from two different bounded regions (A) and (B) of the embedding space discovered by LLE.Each panel shows a two-dimensional projection onto the third and fourth coordinates of LLE; in these two dimensions,the regions (A) and(B) are highly overlapped.The inset in (A) shows a three-dimensional projection onto the third,fourth, and fifth coordinates,revealing an extra dimension along which regions (A) and (B) are more separated. Words that lie in the intersection of both regions are capitalized.Note how LLE colocates words with similar contexts in this continuous semantic space.

图4.在一个连续的语义空间中排列单词.每个词条最初是由一个高维向量来表示的，这个高维向量计算了词条在不同百科全书文章中出现的次数，LLE应用于这些词条计数向量，从而得到了每个词条的嵌入位置，图中显示的是LLE发现的嵌入空间的两个不同边界区域(A)和(B)中的词，每幅图都是LLE在第三和第四坐标上的二维投影，在这两个维度上，区域(A)和(B)是高度重叠的。(A)中的插图显示了在第三、第四和第五坐标上的三维投影，揭示了一个额外的维度，沿着这个维度，区域(A)和(B)更加分离。请注意LLE是如何在这个连续的语义空间中把具有相似语境的词放在一起的。

Many popular learning algorithms for nonlinear dimensionality reduction do not share the favorable properties of LLE.Iterative hill-climbing methods for autoencoder neural networks (13,14),self-organizing maps ( 15), and latent variable models (16) do not have the same guarantees of global optimality or convergence;they also tend to involve many more free parameters,such as learning rates convergence criteria,and architectural specifications.Finally,whereas other nonlinear methods rely on deterministic annealing schemes (17) to avoid local minima,the optimizations of LLE are especially tractable.
许多流行的非线性维度降低的学习算法并不具备LLE的有利特性，用于自动编码器神经网络、自组织地图和潜变量模型的迭代爬坡方法并不具有相同的全局最优性或收敛性保证，它们还往往涉及更多的自由参数，如学习率收敛标准和架构规范。最后,尽管其他非线性方法依靠确定性退火方案来避免局部最小值,但LLE的优化是特别容易解决的。

LLE scales well with the intrinsic manifold dimensionality, d, and does not require a discretized gridding of the embedding space. As more dimensions are added to the embedding space,the existing ones do not change,so that LLE does not have to be rerun to compute higher dimensional embeddings.Unlike methods such as principal curves and surfaces (18) or additive component models(19),LLE is not limited in practice to manifolds of extremely low dimensionality or codimensionality.Also,the intrinsic value of d can itself be estimated by analyzing a reciprocal cost function,in which reconstruction weights derived from the embedding vectors $\vec{Y_i}$ ,are applied to the data points $\vec{X_i}$ .
LLE可以很好地扩展固有的流形维度d，并且不需要对嵌入空间进行离散化网格化。当嵌入空间增加更多的维度时，现有的维度不会改变，因此LLE不需要重新运行来计算更高维度的嵌入。与主曲线和曲面或加法分量模型等方法不同,LLE在实践中并不局限于极低维度或低维度的表现形式。同时,d的固有值也可以通过分析一个往复成本函数来估计,在这个函数中,从嵌入向量 $\vec{Y_i}$ 获得的重建权重被应用于数据点 $\vec{X_i}$ 。

LLE illustrates a general principle of manifold learning，elucidated by Martinetz and Schulten (20) and Tenenbaum(4),that overlapping local neighborhoods-collectively analyzed——can provide information about global geometry.Many virtues of LLE are shared by Tenenbaum’s algorithm,Isomap,which has been successfully applied to similar problems in nonlinear dimensionality reduction. Isomap’s
embeddings,however,are optimized to preserve geodesic distances between general pairs of data points,which can only be estimated by computing shortest paths through large sublattices of data.LLE takes a different approach,analyzing local symmetries,linear coefficients,and reconstruction errors instead of global constraints，pairwise distances，and stress functions. It thus avoids the need to solve large dynamic programming problems，and it also tends to accumulate very sparse matrices, whose structure can be exploited for savings in time and space.
LLE说明了一个由Martinetz和Schulten(20)以及Tenenbaum(4)阐明的流形学习的一般原理，即重叠的局部邻域的综合分析可以提供有关全局几何的信息.LLE的许多优点与Tenenbaum的算法Isomap相同，它已经成功地应用于非线性维度减少的类似问题。然而，Isomap的嵌入是经过优化的，以保留一般数据点对之间的大地测量距离，这只能通过计算大型数据子平面的最短路径来估计。LLE采用不同的方法，分析局部对称性、线性系数和重构误差，而不是全局约束、对偶距离和应力函数，从而避免了解决大型动态编程问题的需要，并且其结构也倾向于积累非常稀疏的矩阵，以节省时间和空间。

LLE is likely to be even more useful in combination with other methods in data analysis and statistical learning.For example，a parametric mapping between the observation and embedding spaces could be learned by supervised neural networks (2 1) whose target
values are generated by LLE.LLE can also be generalized to harder settings，such as the case of disjoint data manifolds (22), and specialized to simpler ones，such as the case of time-ordered observations (23).
LLE与其他数据分析和统计学习方法的结合可能会更加有用，例如，观察空间和嵌入空间之间的参数映射可以由监督神经网络学习，其目标值由LLE生成。LLE也可以推广到更难的环境中，例如不相干的数据显现，以及专门的简单数据流形，如时序观测。

Perhaps the greatest potential lies in applying LLE to diverse problems beyond those considered here.Given the broad appeal of
traditional methods,such as PCA and MDS,the algorithm should find widespread use in many areas of science.
也许最大的潜力在于将LLE应用到这里所考虑的问题之外的其他各种问题上。鉴于传统方法的广泛吸引力，如PCA和MDS，该算法应该在许多科学领域得到广泛的应用。

References and Notes
1.M.L. Littman，D.F. Swayne,N. Dean,A. Buja,in Computing Science and Statistics: Proceedings of the 24th Symposium on the Interface,H.J.N.Newton,Ed. (Interface Foundation of North America,Fairfax Station,VA,1992).pp.208-217.
2.T.Cox,M.Cox,Multidimensional Scaling (Chapman & Hall,London,1994).
3.Y.Takane,F.W.Young,Psychometrika 42,7 (1977).
4.J. Tenenbaum,in Advances in Neural lnformation Processing 10,M.Jordan,M.Kearns, s. Solla,Eds.(MIT Press,Cambridge,MA 1998),pp.682-688.
5. The set of neighbors for each data point can be assigned in a variety of ways: by choosing the K nearest neighbors in Euclidean distance,by considering all data points within a ball of fixed radius, or by using prior knowledge.Note that for fixed number of neighbors,the maximum number of embedding dimensions LLE can be expected to recover is strictlyless than the number of neighbors.
6. For certain applications, one might also constrain the weights to be positive, thus requiring the reconstruction of each data point to lie within the convex hull of its neighbors.
7.Fits: The constrained weights that best reconstruct each data point from its neighbors can be computed in closed form.Consider a particular data point $\vec{x}$ with neighbors $\vec{\eta _j}$ and sum-to-one reconstruction weights w_j.The reconstruction error $\left |\vec{x}-\sum_{j=1}^{k} w_j\vec{\eta _j} \right |^2$ is minimized in three steps.First,evaluate inner products between neighbors to compute the neighborhood correlation matrix, C_jk= $\vec{\eta _j}\cdot\vec{\eta _k}$ , and its matrix inverse,C^-1.Second,compute the Lagrange multiplier，λ= α/B,that enforces the sum-to-one constraint，where $\alpha =1-\sum_{jk}C_{jk}^{-1}(\vec{x}\cdot \vec{\eta _k} )$ and $\beta =\sum_{jk}C_{jk}^{-1}$ .Third,compute the reconstruction weights: $W_j=\sum_{k}C_{jk}^{-1}(\vec{x}\cdot \vec{\eta _k}+\lambda )$ .If the correlation matrix C is nearly singular，it can be conditioned (before inversion)by adding a small multiple of the identity matrix.This amounts to penalizing large weights that exploit correlations beyond some level of precision in the data sampling process.
8. lndeed,LLE does not require the original data to be described in a single coordinate system,only that each data point be located in relation to its neighbors.
9. The embedding vectors $\vec{Y_i}$ are found by minimizing the cost function $\Phi (Y)=\sum_{i}\left | \vec{Y_i}-\sum_{j}W_{ij}\vec{Y_j} \right |^2$ over $\vec{Y_i}$ with fixed weights W_ij.This optimization is performed subject to constraints that make the problem well posed. It is clear that the coordinates $\vec{Y_i}$ can be translated by a constant displacement without affecting the cost, $\Phi (Y)$ . We remove this degree of freedom by requiring the coordinates to be centered on the origin: $\sum_{i}\vec{Y_i}\doteq \vec{0}$ .Also,to avoid degenerate solutions,we constrain the embedding vectors to have unit covariance,with outer products that satisfy $\frac{1}{N}\sum_{i}\vec{Y_i}\bigotimes \vec{Y_i}= I$ , where I is the d ×d identity matrix.Now the cost defines a quadratic form, $\Phi (Y)=\sum_{ij}M_{ij}(\vec{Y_i}\cdot \vec{Y_j})$ ,nvolving inner products of the embedding 'vectors and the symmetric N ×N matrix
$M_{ij}=\delta _{ij}-W_{ij}-W_{ji}+\sum_{k}W_{ki}W_{kj}$ (3)
where $\delta _{ij}$ is 1 if i = j and 0 otherwise.The optimal embedding. up to a global rotation of the embedding space，is found by computing the bottom d + 1 eigenvectors of this matrix (24).The bottom eigenvector of this matrix，which we discard,is the unit vector with all equal components; it represents a free translation mode of eigenvalue zero. (Discarding it enforces the constraint that the embeddings have zero mean.) The remaining d eigenvectors form the d embedding coordinates found by LLE.Note that the matrix M can be stored and manipulated as the sparse matrix ( I- w)^T(I - w). giving substantial computational savings for large values of N. Moreover,its bottom d + 1 eigenvectors (those corresponding to its smallest d + 1 eigenvalues) can be found efficiently without performing a full matrix diagonalization (25).
10.Manifold: Data points in Fig. 1B (N = 2000) were sampled from the manifold (D= 3) shown in Fig.1A.Nearest neighbors (K = 20) were determined by Euclidean distance.This particular manifold was introduced by Tenenbaum(4), who showed that its global structure could be learned by the lsomap algorithm.
11. Faces: Multiple photographs(N = 2000) of the same face were digitized as 20×28 grayscale images.Each image was treated by LLE as a data vector with D=560 elements corresponding to raw pixel intensities.Nearest neighbors (K = 12) were determined by Euclidean distance in pixel space.
12.Words: Word-document counts were tabulated for N= 5000 words from D = 31,000 articles in Grolier’s Encyclopedia (26).Nearest neighbors (K = 20) were determined by dot products between count vectors normalized to unit length.
13.D. DeMers,G.w. Cottrell，in Advances in Neural lnformation Processing Systems 5,D. Hanson,J.Cowan,L. Giles，Eds (Kaufmann，San Mateo，CA,1993).pp.580-587.
14.M.Kramer,AIChE j. 37,233(1991).
15.T.Kohonen,self-Organization and Associative Memory (Springer-Verlag,Berlin，1988).
16.C.Bishop,M.Svensen,C. Williams,Neural Comput.10,215 (1998).
17.H. Klock,J. Buhmann，Pattern Recognition 33,651(1999).
18.i.J. Hastie,W. Stuetzle,J. Am. Stat.Assoc.84,502(1989).
19.D.J.Donnell,A.Buja,W.Stuetzle,Ann.Stat.22,1635(1994).
20.T. Martinetz,K. Schulten,Neural Networks 7,507(1994).
21.D.Beymer,T.Poggio,Science 272,1905 (1996).
22. Although in all the examples considered here,the data had a single connected component, it is possible to formulate LLE for data that lies on several disjoint manifolds,possibly of different underlying dimensionality.Suppose we form a graph by connecting each data point to its neighbors. The number of connected components (27) can be detected by examining powers of its adjacency matrix. Different connected components of the data are essentially decoupled in the eigenvector problem for LLE.Thus,they are best interpreted as lying on distinct manifolds,and are best analyzed separately by LLE.
23. If neighbors correspond to nearby observations in time,then the reconstruction weights can be computed online (as the data itself is being collected) and the embedding can be found by diagonalizing a sparse banded matrix.
24.R.A.Horn,C.R.johnson,Matrix Analysis(Cambridge Univ.Press,Cambridge,1990).
25.Z.Bai,J. Demmel,J.Dongarra,A.Ruhe,H. van der Vorst,Eds,Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide (Society for lndustrial and Applied Mathematics,Philadelphia,PA,20O0).
26.D.D.Lee,H.S.Seung,Nature 401,788 (1999).
27.R. Tarjan，Data Structures and Network Algorithms,CBMS 44(Society for Industrial and Applied Mathematics,Philadelphia,PA,1983).
28.l. T. jolliffe,Principal Component Analysis (SpringerVerlag,New York,1989).
29.N. Kambhatla,T.K. Leen,Neural Comput. 9,1493(1997).
30. We thank G.Hinton and M.Revow for sharing their unpublished work (at the University of Toronto) on segmentation and pose estimation that motivated us to “think globally, fit locally”; ].Tenenbaum(Stanford University) for many stimulating discussions about
his work (4) and for sharing his code for the lsomap algorithm;D.D.Lee(Bell Labs) and B.Frey (University of Waterloo) for making available word and face data from previous work (26); and C. Brody,A. Buja，p.Dayan,z.Ghahramani, G.Hinton,T.Jaakkola,D.Lee,F.Pereira, and M.Sahani for helpful comments. S.T.R.acknowledges the support of the Gatsby Charitable Foundation,the U.S. National Science Foundation,and the National Sciences and Engineering Research Council of Canada.