高维数据可视化工具:t-SNE

Introduce

Base on Visualizing Data using t-SNE.
How to Use t-SNE Effectively
example code

Student t-distribution Stochastic Neighbor Embedding (t-SNE) is a kinds of vsiualize tools, also a clssifier for high-dimmetion(high-dim) data (well proform in N < 100 N<100 N<100), which can embed and clssify the high-dim data into 2 or 3-dim plane clusters. It’s a kinds of unsupervise Mechine Learing (ML) technique.

 64 pixels hand writes digits classify

64 pixels hand writes digits classify & flower species classify (4-input characteristic)

Classifying of data, essentially is a kinds of dimensionality reduction, ML in here is used to, mininmize something (cost function) with larger parameters and non-linear.

Stochastic Neighbor Embedding (SNE), Basis idea of SNE and it’s math

SNE is expeted to find a faithful representation in low-dim for high-dim date, which preserve the small daistant (local) structure and reflect the large daistant of high-dim data.

conditional probability

Assume the one sample of high-dim data can be write down as a vector y i y_i yi and it’s mapping point in low-dim is x i x_i xi. Then define to conditional probability p j ∣ i , q j ∣ i p_{j|i}, q_{j|i} pji,qji :

x i     high-dim vector, a fixed piont as real data in high-dim y i     mapping low-dim vector, movable and mapping point in low-dim x_i \ \ \ \ \text{high-dim vector, a fixed piont as real data in high-dim} \\ y_i \ \ \ \ \text{mapping low-dim vector, movable and mapping point in low-dim} xi    high-dim vector, a fixed piont as real data in high-dimyi    mapping low-dim vector, movable and mapping point in low-dim

p j ∣ i = exp ⁡ ( − ∥ x i − x j ∥ 2 / 2 σ i 2 ) ∑ k ≠ i exp ⁡ ( − ∥ x i − x k ∥ 2 / 2 σ i 2 ) ,    p i ∣ i = 0 p_{j \mid i}=\frac{\exp \left(-\left\|x_{i}-x_{j}\right\|^{2} / 2 \sigma_{i}^{2}\right)}{\sum_{k \neq i} \exp \left(-\left\|x_{i}-x_{k}\right\|^{2} / 2 \sigma_{i}^{2}\right)}, \ \ p_{i \mid i}= 0 pji=k=iexp(xixk2/2σi2)exp(xixj2/2σi2),  pii=0

q j ∣ i = exp ⁡ ( − ∥ y i − y j ∥ 2 ) ∑ k ≠ i exp ⁡ ( − ∥ y i − y k ∥ 2 ) ,    σ = 1 2 ,    q i ∣ i = 0 q_{j \mid i}=\frac{\exp \left(-\left\|y_{i}-y_{j}\right\|^{2}\right)}{\sum_{k \neq i} \exp \left(-\left\|y_{i}-y_{k}\right\|^{2}\right)}, \ \ \sigma = \frac{1}{\sqrt{2}} , \ \ q_{i \mid i}= 0 qji=k=iexp(yiyk2)exp(yiyj2),  σ=2 1,  qii=0

where ∥ x i − x j ∥ \left\|x_{i}-x_{j}\right\| xixj is the Euclidean distances. The s i m i l a r i t y similarity similarity of datapoint x j x_j xj to datapoint x i x_i xi is the conditional probability, p j ∣ i p_{j|i} pji, that x i x_i xi would pick x j x_j xj as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at x i x_i xi. For example, for larger distance, p j ∣ i ∼ 0 p_{j|i} \sim 0 pji0. The σ i \sigma_i σi here, perform a rescaling effect.

If y i y_i yi faithful to x i x_i xi, that should be q j ∣ i ≈ p j ∣ i q_{j \mid i} \approx p_{j \mid i} qjipji

The Cost function

It’s Kullback- Leibler divergence (which is in this case equal to the cross-entropy up to an additive constant, also named relative entropy and well uesd in such as A d s − C F T Ads-CFT AdsCFT Holographic theory, is equivalent to the difference between the Shannon entropy of two probability distributions).
C = ∑ i K L ( P i ∥ Q i ) = ∑ i ∑ j p j ∣ i log ⁡ p j ∣ i q j ∣ i C=\sum_{i} K L\left(P_{i} \| Q_{i}\right)=\sum_{i} \sum_{j} p_{j \mid i} \log \frac{p_{j \mid i}}{q_{j \mid i}} C=iKL(PiQi)=ijpjilogqjipji
Shannon entropy is S = − ∑ j p j log ⁡ p j S = -\sum_{j} p_{j } \log {p_{j}} S=jpjlogpj
∑ j p j ∣ i log ⁡ p j ∣ i q j ∣ i = ∑ j p j ∣ i log ⁡ p j ∣ i − ∑ j p j ∣ i log ⁡ q j ∣ i \sum_{j} p_{j \mid i} \log \frac{p_{j \mid i}}{q_{j \mid i}} = \sum_{j} p_{j \mid i} \log {p_{j \mid i}} - \sum_{j} p_{j \mid i} \log {q_{j \mid i}} jpjilogqjipji=jpjilogpjijpjilogqji

if y i y_i yi faithful to x i x_i xi, have q j ∣ i ≈ p j ∣ i q_{j \mid i} \approx p_{j \mid i} qjipji, that C = 0 C=0 C=0.

why minimize Cost function?

It’s Larger cost by mapping the widely separated x i x_i xi points to dense y i y_i yi points.

For simplicity, we consider such scenario:
请添加图片描述
Right panel: high-dim x x x plane, x i x_i xi nearby x j x_j xj, but others far away.
Left panel: low-dim y y y plane, y i y_i yi nearby y k y_k yk, but others including y j y_j yj far away.

For x x x plane, set σ = 1 / 2 \sigma = 1/\sqrt{2} σ=1/2 , have
p j ∣ i ≈ exp ⁡ ( − ∥ x i − x j ∥ 2 ) exp ⁡ ( − ∥ x i − x j ∥ 2 ) = 1 p_{j \mid i} \approx \frac{\exp \left(-\left\|x_{i}-x_{j}\right\|^{2}\right)}{ \exp \left(-\left\|x_{i}-x_{j}\right\|^{2} \right)} = 1 pjiexp(xixj2)exp(xixj2)=1
Due to large distace between the other points, exp ⁡ ( − ∥ x i − x k ∥ 2 ) ∼ 0 \exp \left(-\left\|x_{i}-x_{k}\right\|^{2}\right) \sim 0 exp(xixk2)0.

For y y y plane, have
q j ∣ i ≈ exp ⁡ ( − ∥ y i − y j ∥ 2 ) exp ⁡ ( − ∥ y i − y k ∥ 2 ) = 0 q_{j \mid i} \approx \frac{\exp \left(-\left\|y_{i}-y_{j}\right\|^{2}\right)}{\exp \left(-\left\|y_{i}-y_{k}\right\|^{2}\right)}= 0 qjiexp(yiyk2)exp(yiyj2)=0
Due to the small distant of ∥ y i − y k ∥ \left\|y_{i}-y_{k}\right\| yiyk and large distance between the other.

Then the Cost function contribute by index j , i j,i j,i is
C j i = p j ∣ i log ⁡ p j ∣ i q j ∣ i = log ⁡ 1 0 ∼ ∞ C_{j i} = p_{j \mid i} \log \frac{p_{j \mid i}}{q_{j \mid i}} =\log \frac{1}{0} \sim \infty Cji=pjilogqjipji=log01
So after the minimizing of Cost function, it will not appear. In mapping y y y-plane, the near piont of y i y_i yi will be y j y_j yj, becasue in this time, q j ∣ i ≈ 1 q_{j \mid i} \approx 1 qji1, then C j i = 0 C_{j i} = 0 Cji=0.

Entropy and perplexity

If we using probability to discribe thing, it’s natural relate it to information entropy or Bayes. The define of Shannon entropy is:
H ( P i ) = − ∑ j p j ∣ i log ⁡ 2 p j ∣ i H\left(P_{i}\right)=-\sum_{j} p_{j \mid i} \log _{2} p_{j \mid i} H(Pi)=jpjilog2pji
The perplexity is defined as
Perp ⁡ ( P i ) = 2 H ( P i ) \operatorname{Perp}\left(P_{i}\right)=2^{H\left(P_{i}\right)} Perp(Pi)=2H(Pi)
How complex one thing can be?

entropy ≈ Number of bits perplexity ≈ Number of states \text{entropy} \approx \text{Number of bits} \\ \text{perplexity} \approx \text{Number of states} entropyNumber of bitsperplexityNumber of states
It dicide by how many bits are needed to describe all of its possible states. For example, coin tossing, “heads” and “tails” two state, 1-bit entropy.

Example:

Consider there are 4 coins (equal probability situation)
entropy = 4 , H = 16 × − 1 2 4 log ⁡ 2 1 2 4 = 4 Perplexity = 2 4 possible states \text{entropy} = 4, \\ H= 16 \times - \frac{1}{2^4} \log _{2} \frac{1}{2^4} = 4 \\ \text{Perplexity} = 2^4 \text{possible states} entropy=4,H=16×241log2241=4Perplexity=24possible states

SNE performs a binary search for the value of σ i \sigma_i σi that produces a P i P_i Pi with a fixed perplexity ( 5 ∼ 50 5 \sim 50 550) that is specified by the user.

In dense regions, a smaller value of σ i \sigma_i σi is usually more appropriate.

Futher understanding of Perplexity \text{Perplexity} Perplexity and σ i \sigma_i σi

Perplexity of  x i = effective number of points neighbor  x i The range of neighbor is set by  σ i \text{Perplexity of } x_i = \text{effective number of points neighbor } x_i \\ \text{The range of neighbor is set by } \sigma_i Perplexity of xi=effective number of points neighbor xiThe range of neighbor is set by σi
Example:

if point x i x_i xi with 16 same distance neighbor points, in a specify σ i \sigma_i σi, which will make p k ∣ i = 1 / 16 p_{k|i} =1/16 pki=1/16, that the Perplexity of x i = 16 x_i=16 xi=16, In fact, the Perplexity is specified by the user, algorithm will find σ i \sigma_i σi

Effect of cahnge σ i \sigma_i σi

This distribution has an entropy which increases as σ i σ_i σi increases.
Consider N N N equidistant ( d d d) points, only calculate the ponits’ with distance inner 3 σ i 3 \sigma_i 3σi, assmue have M M M point:
p j ∣ i ≈ exp ⁡ ( − d 2 / σ i ) ( M − 1 ) exp ⁡ ( − d 2 / σ i ) ≈ 1 M H = log ⁡ 2 M p_{j \mid i} \approx \frac{\exp \left(-d^{2}/\sigma_i \right)}{ (M-1)\exp \left(-d^{2}/\sigma_i \right)} \approx \frac{1}{M} \\ H = \log _{2} M pji(M1)exp(d2/σi)exp(d2/σi)M1H=log2M
Next increases the σ i → σ i ′ \sigma_i \to \sigma_i' σiσi, the inner 3 σ i ′ 3 \sigma_i' 3σi points also increases M → M ′ M \to M' MM, H = log ⁡ 2 M ′ H=\log _{2} M' H=log2M.

Minimize the Cost Function

Using the gradient descent method, the gradient of Cost function is:
δ C δ y i = 2 ∑ j ( p j ∣ i − q j ∣ i + p i ∣ j − q i ∣ j ) ( y i − y j ) \frac{\delta C}{\delta y_{i}}=2 \sum_{j}\left(p_{j \mid i}-q_{j \mid i}+p_{i \mid j}-q_{i \mid j}\right)\left(y_{i}-y_{j}\right) δyiδC=2j(pjiqji+pijqij)(yiyj)
It’s easy calculate from difinition.
After the algorithm find the σ i \sigma_i σi, changing the mapping y i y_i yi alone with the invert direction of gradient, will reach the minimum point of C C C.

The dynamics of y i y_i yi points can analogy to N N N-degree spring system, where
k = 2 ∑ j ( p j ∣ i − q j ∣ i + p i ∣ j − q i ∣ j ) δ C δ y i = k ( y i − y j ) k = 2 \sum_{j}\left(p_{j \mid i}-q_{j \mid i}+p_{i \mid j}-q_{i \mid j}\right) \\ \frac{\delta C}{\delta y_{i}}= k \left(y_{i}-y_{j}\right) k=2j(pjiqji+pijqij)δyiδC=k(yiyj)
if y i y_i yi faithful to x i x_i xi, have q j ∣ i ≈ p j ∣ i q_{j \mid i} \approx p_{j \mid i} qjipji and q i ∣ j ≈ p i ∣ j q_{i \mid j} \approx p_{i \mid j} qijpij, similarity is equal, that k = 0 , δ C δ y i = 0 k=0,\frac{\delta C}{\delta y_{i}}=0 k=0,δyiδC=0. Before this, the spring system will try to cluster itself.

Updata of y i y_i yi

Y ( t ) = Y ( t − 1 ) + η δ C δ Y + α ( t ) ( Y ( t − 1 ) − Y ( t − 2 ) ) \mathcal{Y}^{(t)}=\mathcal{Y}^{(t-1)}+\eta \frac{\delta C}{\delta \mathcal{Y}}+\alpha(t)\left(\mathcal{Y}^{(t-1)}-\mathcal{Y}^{(t-2)}\right) Y(t)=Y(t1)+ηδYδC+α(t)(Y(t1)Y(t2))
where Y ( t ) \mathcal{Y}^{(t)} Y(t) indicates the solution at iteration t, η \eta η indicates the learning rate, and α ( t ) \alpha(t) α(t) represents the momentum at iteration t.

How the mapping points moving in y y y-plane with t t t?
请添加图片描述

t-Distribution SNE

Difference to SNE:

  1. symmetrized version of the SNE cost function (set p i ∣ j = p j ∣ i , q i ∣ j = q j ∣ i p_{i|j}= p_{j|i},q_{i|j}= q_{j|i} pij=pji,qij=qji)
  2. uses a Student-t distribution rather than a Gaussian to compute the similarity ( q i j q_{ij} qij) between two points in the y i y_i yi space

Symmetrized the similarity
p i j = p j ∣ i + p i ∣ j 2 n q i j = q j ∣ i + q i ∣ j 2 n p_{i j}=\frac{p_{j \mid i}+p_{i \mid j}}{2 n} \\ q_{i j}=\frac{q_{j \mid i}+q_{i \mid j}}{2 n} pij=2npji+pijqij=2nqji+qij
Cost function become
δ C δ y i = 4 ∑ j ( p i j − q i j ) ( y i − y j ) \frac{\delta C}{\delta y_{i}}=4 \sum_{j}\left(p_{i j}-q_{i j}\right)\left(y_{i}-y_{j}\right) δyiδC=4j(pijqij)(yiyj)

“Crowding Problem”

For example, map the m m m-dim data into 2 2 2-dim, in somewhere data dense area, the volume of x x x-plane is ∼ r m \sim r^m rm, but the volume in y y y-plane is ∼ r 2 \sim r^2 r2.

Solve

In the process of minimizing, add some guassian noise to simulate annealing, then add a slight repulsion to all springs.

The repulsion term is q i j   ( − k ) q_{ij} \ (-k) qij (k), set
q i j > 2 ρ n ( n − 1 ) q_{ij} > \frac{2 \rho}{n(n-1)} qij>n(n1)2ρ
The minimize process is:
Start ( ρ = 0 \rho=0 ρ=0) → \to introduce gaussian noise in some mid-t steps → \to introduce backgroup repulsion ( ρ > 0 \rho > 0 ρ>0) → \to Gap of cluster appear.

Problem of directly optimize UNI-SNE

If add the repulsion at begin, consider large distance points y i , y j y_i,y_j yi,yj, it’s
q i j ∼ 2 ρ n ( n − 1 ) q_{ij} \sim \frac{2 \rho}{n(n-1)} qijn(n1)2ρ
nearly same the background repulsion. Small change of ∥ y i − y j ∥ \left\|y_{i}-y_{j}\right\| yiyj will no influence on Cost fuction C C C, q i j q_{ij} qij still ∼ 2 ρ n ( n − 1 ) \sim \frac{2 \rho}{n(n-1)} n(n1)2ρ.

Student t-distribution

q i j = ( 1 + ∥ y i − y j ∥ 2 ) − 1 ∑ k ≠ l ( 1 + ∥ y k − y l ∥ 2 ) − 1 q_{i j}=\frac{\left(1+\left\|y_{i}-y_{j}\right\|^{2}\right)^{-1}}{\sum_{k \neq l}\left(1+\left\|y_{k}-y_{l}\right\|^{2}\right)^{-1}} qij=k=l(1+ykyl2)1(1+yiyj2)1
In high-dim ( x x x-plane) still use a Gussian distribution, but in y y y-plane use the above one.

Whne ∥ y i − y j ∥ \left\|y_{i}-y_{j}\right\| yiyj large, than
q i j ∼ 1 / ∥ y i − y j ∥ 2 ∼ long range force  1 / r 2 q_{ij} \sim 1/\left\|y_{i}-y_{j}\right\|^2 \sim \text{long range force } 1/r^2 qij1/yiyj2long range force 1/r2
New gradient:
δ C δ y i = 4 ∑ j ( p i j − q i j ) ( y i − y j ) ( 1 + ∥ y i − y j ∥ 2 ) − 1 \frac{\delta C}{\delta y_{i}}=4 \sum_{j}\left(p_{i j}-q_{i j}\right)\left(y_{i}-y_{j}\right)\left(1+\left\|y_{i}-y_{j}\right\|^{2}\right)^{-1} δyiδC=4j(pijqij)(yiyj)(1+yiyj2)1

Optimization Trick

“early compression”: force the map points to stay close together at the start of the optimization, make the clusters easier run/explore the global space of possible.

“early exaggeration” (used in t-SNE): multiply all of the p i j p_{i j} pij by a factor, for example, 4, in the initial stages of the optimization. Data tend to form tight widely separated clusters in the map. Makes points much easier move in cluster.

computational bottleneck

The p i j , q i j p_{ij},q_{ij} pij,qij are point to points, so as data points increase, computational amount will go as
n ( n − 1 ) / 2 ∼ n 2 n(n-1)/2 \sim n^2 n(n1)/2n2
For the data pointa ≥ 10 , 000 \ge 10,000 10,000 will be time-consuming, one solution is random sampling a sub-set.

Ramda wlk

Random walk-based affinity measure. More efficient, faster. prevent “short-cut” problem.请添加图片描述A,B,C are equidistant, but A,B in same cluster, while C in another cluster. The single noise point between two different clusters will not influence the cluster partitioning, because the similarity of two point ( x i , x j x_i,x_j xi,xj) is not by the shortest path, while is integration of all path.

The emission probability from x i → x j x_i \to x_j xixj is
e − ∥ x i − x j ∥ 2 e^{-\left\|x_{i}-x_{j}\right\|^{2}} exixj2

Other data dimension reduction methods

  1. PCA 1979
  2. classical scaling. Minimze the Cost fuction
    C = 1 ∑ i j ∥ x i − x j ∥ ∑ i ≠ j ( ∥ x i − x j ∥ − ∥ y i − y j ∥ ) 2 ∥ x i − x j ∥ = 1 N ∑ i ≠ j σ i j 2 d i j C=\frac{1}{\sum_{i j}\left\|x_{i}-x_{j}\right\|} \sum_{i \neq j} \frac{\left(\left\|x_{i}-x_{j}\right\|-\left\|y_{i}-y_{j}\right\|\right)^{2}}{\left\|x_{i}-x_{j}\right\|} = \frac{1}{N} \sum_{i \neq j} \frac{\sigma_{ij}^2}{d_{ij}} C=ijxixj1i=jxixj(xixjyiyj)2=N1i=jdijσij2
    if ∥ x i − x j ∥ \left\|x_{i}-x_{j}\right\| xixj is large, then ∥ y i − y j ∥ \left\|y_{i}-y_{j}\right\| yiyj is large but also shrink by a factor 1 ∥ x i − x j ∥ \frac{1}{\left\|x_{i}-x_{j}\right\|} xixj1. It will shrink large distance, amplify short distance in high-dim data.
  3. Sammon mapping (Sammon, 1969)
  4. t-SNE.
  5. Isomap
  6. LLE…

Diffusion method

Diffusion-based interpretation: the mapping pioints distance ∥ y i − y j ∥ \left\|y_{i}-y_{j}\right\| yiyj should be same as diffusion distance.

Definiton of diffusion distance:
D ( t ) ( x i , x j ) = ∑ k ( p i k ( t ) − p j k ( t ) ) 2 ψ ( x k ) ( 0 ) D^{(t)}\left(x_{i}, x_{j}\right)=\sqrt{\sum_{k} \frac{\left(p_{i k}^{(t)}-p_{j k}^{(t)}\right)^{2}}{\psi\left(x_{k}\right)^{(0)}}} D(t)(xi,xj)=kψ(xk)(0)(pik(t)pjk(t))2
Cost function:
C = ∑ i ∑ j ( D ( t ) ( x i , x j ) − ∥ y i − y j ∥ ) 2 C=\sum_{i} \sum_{j}\left(D^{(t)}\left(x_{i}, x_{j}\right)-\left\|y_{i}-y_{j}\right\|\right)^{2} C=ij(D(t)(xi,xj)yiyj)2
where p i j ( t ) p_{i j}^{(t)} pij(t) represents the probability of a particle traveling from x i x_i xi to x j x_j xj in t timesteps with Gaussian emission probabilities, ψ ( x k ) ( 0 ) \psi\left(x_{k}\right)^{(0)} ψ(xk)(0) is a measure for the local density of the points.
请添加图片描述请添加图片描述

how x i x_i xi particle travel to x j x_j xj ?

请添加图片描述
Emission probability of x i → x j x_i \to x_j xixj (arrow line) is equal e − ∥ x i − x j ∥ 2 / σ 2 e^{-\left\|x_{i}-x_{j}\right\|^2/\sigma^2} exixj2/σ2 .
For solid line, x i → x j x_i \to x_j xixj will take 4 time-steps, but high probability.
For dashed line, x i → x j x_i \to x_j xixj will take 1 time-step, but low probability.

Assign much higher importance to the large diffusion distances rather than small, not good at retaining the local structure.

Weakness of t-SNE

  1. Unclear the porform in high-dim y y y. (dimension of y > 3 y>3 y>3)
  2. Sensitive to the intrinsic dimension of data (dimension of x i > 100 x_i>100 xi>100 maybe less successful).
  3. Not guranteed converge the Cost function.

Experimental

compare with other methods

Visualizations of 6,000 handwritten digits from the MNIST data set.
请添加图片描述请添加图片描述请添加图片描述请添加图片描述

classify task

64 pixels hand writes digits

请添加图片描述请添加图片描述
Input shape, 1797 pictures, 64 pixels.

Iris flower classify

Input characteristic of iris flower

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

请添加图片描述

Input shape, (150,4)

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值