关于多维数据显示的一篇文章

从SNE到t-SNE再到LargeVis
原文 http://bindog.github.io/blog/2016/06/04/from-sne-to-tsne-to-largevis

其实就是把多维数据投影到2D上显示,其中构造KNN的算法比较有意思,这篇文章说的很清楚。

Abstract
1. Compute similarity structure of data points.
2. Project into a low-dimensional space.
This 2 steps computation costs, such as T-SNE.
(T-SNE, a method from scaling to large-scale and high-dimensional data.)

LargeVis (algorithm):
1. Constructs an accurately approximated K-nearest neighbor graph
2. Layout the graph in the low-dimensional space.

Advantages:
1. Reduce the cost of graph construction step
2. Employs a principled probabilistic model for the visualization step, objective can be effectively optimized through asynchronous stochastic gradient descent with a linear time complexity.( 具有线性时间复杂度的异步随机梯度下降来有效地优化)
3. Easily scales to millions of high dimensional data points.

Introduction
Essential idea: project high-dimensional data into space with fewer dimensions.
Preserve the intrinsic structure of the high-dimensional data. (i.e, keeping similar data points close and dissimilar data far apart)

There are many dimensionality reduction techniques
1. Linear mapping: Principle Component Analysis, multidimensional scaling)
As most high-dimensional data usually lie on or near a low-dimensional non-linear manifold, linear mapping methods usually not satisifactory.
2. Non-linear mapping: c, Locally linear embedding, Laplacian Eigenmaps)
Effective on small, laboratory data sets, not perform well on high-dimensional, real data as they are typically not able to preserve both the local and the global structures of the high-dimensional data.
3. t-SNE:
3.1. Constructing a K-nearest neighbor (KNN) graph of the data points.
3.2. Projecting the graph into low-dimensional spaces with tree-based algorithms.
Reason for unsatisfied applied to data with millions of points and hundreds of dimensions.
1) K-nearest neighbor graph is a computational bottleneck for dealing with large-scale and high-dimensional data.(vantage-point trees)
2) The efficiency of the graph visualization step significantly deteriorates when the size of the data becomes large.
3) The parameters of the t-SNE are very sensitive on different data sets
4. LargeVis
4.1. Propose a very efficient algorithm to construct an approximate K-nearest neighbor graph from large-scale, high-dimensional data.
4.2. Propose a principled probabilistic model for graph visualization. The model preserves the structures of the graph in the low-dimensional space, keeping similar data points close and dissimilar data points far away from each other. The objective function of the model can be effectively optimized through asynchronous stochastic gradient descent with a time complexity of O(N).
4.3 The parameters are not sensitive to different data sets and effective in real data.

Related Work
2.1 K-nearest neighbor Graph construction
While the exact computation of a KNN has a complexity of O(N 2 d) (with N being the number of data points and d being the number of dimensions) which is too costly.
1) Space-partitioning trees
2) Locality sensitive hashing
3) Neighbor exploring

LargeVis
A neighbor of my neighbor is also likely to be my neighbor.
1. Build a few random projection trees to construct an approximate K-nearest neighbor graph.
2. Then for each node of the graph, we search the neighbors of its neighbors. May repeat multiple iterations to improve the accuracy of the graph.
3. A principled probabilistic model is used for projecting the graph into a 2D/3D space.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值