Introduce
Base on Visualizing Data using t-SNE.
How to Use t-SNE Effectively
example code
Student t-distribution Stochastic Neighbor Embedding (t-SNE) is a kinds of vsiualize tools, also a clssifier for high-dimmetion(high-dim) data (well proform in N < 100 N<100 N<100), which can embed and clssify the high-dim data into 2 or 3-dim plane clusters. It’s a kinds of unsupervise Mechine Learing (ML) technique.
Classifying of data, essentially is a kinds of dimensionality reduction, ML in here is used to, mininmize something (cost function) with larger parameters and non-linear.
Stochastic Neighbor Embedding (SNE), Basis idea of SNE and it’s math
SNE is expeted to find a faithful representation in low-dim for high-dim date, which preserve the small daistant (local) structure and reflect the large daistant of high-dim data.
conditional probability
Assume the one sample of high-dim data can be write down as a vector y i y_i yi and it’s mapping point in low-dim is x i x_i xi. Then define to conditional probability p j ∣ i , q j ∣ i p_{j|i}, q_{j|i} pj∣i,qj∣i :
x i high-dim vector, a fixed piont as real data in high-dim y i mapping low-dim vector, movable and mapping point in low-dim x_i \ \ \ \ \text{high-dim vector, a fixed piont as real data in high-dim} \\ y_i \ \ \ \ \text{mapping low-dim vector, movable and mapping point in low-dim} xi high-dim vector, a fixed piont as real data in high-dimyi mapping low-dim vector, movable and mapping point in low-dim
p j ∣ i = exp ( − ∥ x i − x j ∥ 2 / 2 σ i 2 ) ∑ k ≠ i exp ( − ∥ x i − x k ∥ 2 / 2 σ i 2 ) , p i ∣ i = 0 p_{j \mid i}=\frac{\exp \left(-\left\|x_{i}-x_{j}\right\|^{2} / 2 \sigma_{i}^{2}\right)}{\sum_{k \neq i} \exp \left(-\left\|x_{i}-x_{k}\right\|^{2} / 2 \sigma_{i}^{2}\right)}, \ \ p_{i \mid i}= 0 pj∣i=∑k=iexp(−∥xi−xk∥2/2σi2)exp(−∥xi−xj∥2/2σi2), pi∣i=0
q j ∣ i = exp ( − ∥ y i − y j ∥ 2 ) ∑ k ≠ i exp ( − ∥ y i − y k ∥ 2 ) , σ = 1 2 , q i ∣ i = 0 q_{j \mid i}=\frac{\exp \left(-\left\|y_{i}-y_{j}\right\|^{2}\right)}{\sum_{k \neq i} \exp \left(-\left\|y_{i}-y_{k}\right\|^{2}\right)}, \ \ \sigma = \frac{1}{\sqrt{2}} , \ \ q_{i \mid i}= 0 qj∣i=∑k=iexp(−∥yi−yk∥2)exp(−∥yi−yj∥2), σ=21, qi∣i=0
where ∥ x i − x j ∥ \left\|x_{i}-x_{j}\right\| ∥xi−xj∥ is the Euclidean distances. The s i m i l a r i t y similarity similarity of datapoint x j x_j xj to datapoint x i x_i xi is the conditional probability, p j ∣ i p_{j|i} pj∣i, that x i x_i xi would pick x j x_j xj as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at x i x_i xi. For example, for larger distance, p j ∣ i ∼ 0 p_{j|i} \sim 0 pj∣i∼0. The σ i \sigma_i σi here, perform a rescaling effect.
If y i y_i yi faithful to x i x_i xi, that should be q j ∣ i ≈ p j ∣ i q_{j \mid i} \approx p_{j \mid i} qj∣i≈pj∣i
The Cost function
It’s Kullback- Leibler divergence (which is in this case equal to the cross-entropy up to an additive constant, also named relative entropy and well uesd in such as
A
d
s
−
C
F
T
Ads-CFT
Ads−CFT Holographic theory, is equivalent to the difference between the Shannon entropy of two probability distributions).
C
=
∑
i
K
L
(
P
i
∥
Q
i
)
=
∑
i
∑
j
p
j
∣
i
log
p
j
∣
i
q
j
∣
i
C=\sum_{i} K L\left(P_{i} \| Q_{i}\right)=\sum_{i} \sum_{j} p_{j \mid i} \log \frac{p_{j \mid i}}{q_{j \mid i}}
C=i∑KL(Pi∥Qi)=i∑j∑pj∣ilogqj∣ipj∣i
Shannon entropy is
S
=
−
∑
j
p
j
log
p
j
S = -\sum_{j} p_{j } \log {p_{j}}
S=−∑jpjlogpj
∑
j
p
j
∣
i
log
p
j
∣
i
q
j
∣
i
=
∑
j
p
j
∣
i
log
p
j
∣
i
−
∑
j
p
j
∣
i
log
q
j
∣
i
\sum_{j} p_{j \mid i} \log \frac{p_{j \mid i}}{q_{j \mid i}} = \sum_{j} p_{j \mid i} \log {p_{j \mid i}} - \sum_{j} p_{j \mid i} \log {q_{j \mid i}}
j∑pj∣ilogqj∣ipj∣i=j∑pj∣ilogpj∣i−j∑pj∣ilogqj∣i
if y i y_i yi faithful to x i x_i xi, have q j ∣ i ≈ p j ∣ i q_{j \mid i} \approx p_{j \mid i} qj∣i≈pj∣i, that C = 0 C=0 C=0.
why minimize Cost function?
It’s Larger cost by mapping the widely separated x i x_i xi points to dense y i y_i yi points.
For simplicity, we consider such scenario:
Right panel: high-dim
x
x
x plane,
x
i
x_i
xi nearby
x
j
x_j
xj, but others far away.
Left panel: low-dim
y
y
y plane,
y
i
y_i
yi nearby
y
k
y_k
yk, but others including
y
j
y_j
yj far away.
For
x
x
x plane, set
σ
=
1
/
2
\sigma = 1/\sqrt{2}
σ=1/2, have
p
j
∣
i
≈
exp
(
−
∥
x
i
−
x
j
∥
2
)
exp
(
−
∥
x
i
−
x
j
∥
2
)
=
1
p_{j \mid i} \approx \frac{\exp \left(-\left\|x_{i}-x_{j}\right\|^{2}\right)}{ \exp \left(-\left\|x_{i}-x_{j}\right\|^{2} \right)} = 1
pj∣i≈exp(−∥xi−xj∥2)exp(−∥xi−xj∥2)=1
Due to large distace between the other points,
exp
(
−
∥
x
i
−
x
k
∥
2
)
∼
0
\exp \left(-\left\|x_{i}-x_{k}\right\|^{2}\right) \sim 0
exp(−∥xi−xk∥2)∼0.
For
y
y
y plane, have
q
j
∣
i
≈
exp
(
−
∥
y
i
−
y
j
∥
2
)
exp
(
−
∥
y
i
−
y
k
∥
2
)
=
0
q_{j \mid i} \approx \frac{\exp \left(-\left\|y_{i}-y_{j}\right\|^{2}\right)}{\exp \left(-\left\|y_{i}-y_{k}\right\|^{2}\right)}= 0
qj∣i≈exp(−∥yi−yk∥2)exp(−∥yi−yj∥2)=0
Due to the small distant of
∥
y
i
−
y
k
∥
\left\|y_{i}-y_{k}\right\|
∥yi−yk∥ and large distance between the other.
Then the Cost function contribute by index
j
,
i
j,i
j,i is
C
j
i
=
p
j
∣
i
log
p
j
∣
i
q
j
∣
i
=
log
1
0
∼
∞
C_{j i} = p_{j \mid i} \log \frac{p_{j \mid i}}{q_{j \mid i}} =\log \frac{1}{0} \sim \infty
Cji=pj∣ilogqj∣ipj∣i=log01∼∞
So after the minimizing of Cost function, it will not appear. In mapping
y
y
y-plane, the near piont of
y
i
y_i
yi will be
y
j
y_j
yj, becasue in this time,
q
j
∣
i
≈
1
q_{j \mid i} \approx 1
qj∣i≈1, then
C
j
i
=
0
C_{j i} = 0
Cji=0.
Entropy and perplexity
If we using probability to discribe thing, it’s natural relate it to information entropy or Bayes. The define of Shannon entropy is:
H
(
P
i
)
=
−
∑
j
p
j
∣
i
log
2
p
j
∣
i
H\left(P_{i}\right)=-\sum_{j} p_{j \mid i} \log _{2} p_{j \mid i}
H(Pi)=−j∑pj∣ilog2pj∣i
The perplexity is defined as
Perp
(
P
i
)
=
2
H
(
P
i
)
\operatorname{Perp}\left(P_{i}\right)=2^{H\left(P_{i}\right)}
Perp(Pi)=2H(Pi)
How complex one thing can be?
entropy
≈
Number of bits
perplexity
≈
Number of states
\text{entropy} \approx \text{Number of bits} \\ \text{perplexity} \approx \text{Number of states}
entropy≈Number of bitsperplexity≈Number of states
It dicide by how many bits are needed to describe all of its possible states. For example, coin tossing, “heads” and “tails” two state, 1-bit entropy.
Example:
Consider there are 4 coins (equal probability situation)
entropy
=
4
,
H
=
16
×
−
1
2
4
log
2
1
2
4
=
4
Perplexity
=
2
4
possible states
\text{entropy} = 4, \\ H= 16 \times - \frac{1}{2^4} \log _{2} \frac{1}{2^4} = 4 \\ \text{Perplexity} = 2^4 \text{possible states}
entropy=4,H=16×−241log2241=4Perplexity=24possible states
SNE performs a binary search for the value of σ i \sigma_i σi that produces a P i P_i Pi with a fixed perplexity ( 5 ∼ 50 5 \sim 50 5∼50) that is specified by the user.
In dense regions, a smaller value of σ i \sigma_i σi is usually more appropriate.
Futher understanding of Perplexity \text{Perplexity} Perplexity and σ i \sigma_i σi
Perplexity of
x
i
=
effective number of points neighbor
x
i
The range of neighbor is set by
σ
i
\text{Perplexity of } x_i = \text{effective number of points neighbor } x_i \\ \text{The range of neighbor is set by } \sigma_i
Perplexity of xi=effective number of points neighbor xiThe range of neighbor is set by σi
Example:
if point x i x_i xi with 16 same distance neighbor points, in a specify σ i \sigma_i σi, which will make p k ∣ i = 1 / 16 p_{k|i} =1/16 pk∣i=1/16, that the Perplexity of x i = 16 x_i=16 xi=16, In fact, the Perplexity is specified by the user, algorithm will find σ i \sigma_i σi
Effect of cahnge σ i \sigma_i σi
This distribution has an entropy which increases as
σ
i
σ_i
σi increases.
Consider
N
N
N equidistant (
d
d
d) points, only calculate the ponits’ with distance inner
3
σ
i
3 \sigma_i
3σi, assmue have
M
M
M point:
p
j
∣
i
≈
exp
(
−
d
2
/
σ
i
)
(
M
−
1
)
exp
(
−
d
2
/
σ
i
)
≈
1
M
H
=
log
2
M
p_{j \mid i} \approx \frac{\exp \left(-d^{2}/\sigma_i \right)}{ (M-1)\exp \left(-d^{2}/\sigma_i \right)} \approx \frac{1}{M} \\ H = \log _{2} M
pj∣i≈(M−1)exp(−d2/σi)exp(−d2/σi)≈M1H=log2M
Next increases the
σ
i
→
σ
i
′
\sigma_i \to \sigma_i'
σi→σi′, the inner
3
σ
i
′
3 \sigma_i'
3σi′ points also increases
M
→
M
′
M \to M'
M→M′,
H
=
log
2
M
′
H=\log _{2} M'
H=log2M′.
Minimize the Cost Function
Using the gradient descent method, the gradient of Cost function is:
δ
C
δ
y
i
=
2
∑
j
(
p
j
∣
i
−
q
j
∣
i
+
p
i
∣
j
−
q
i
∣
j
)
(
y
i
−
y
j
)
\frac{\delta C}{\delta y_{i}}=2 \sum_{j}\left(p_{j \mid i}-q_{j \mid i}+p_{i \mid j}-q_{i \mid j}\right)\left(y_{i}-y_{j}\right)
δyiδC=2j∑(pj∣i−qj∣i+pi∣j−qi∣j)(yi−yj)
It’s easy calculate from difinition.
After the algorithm find the
σ
i
\sigma_i
σi, changing the mapping
y
i
y_i
yi alone with the invert direction of gradient, will reach the minimum point of
C
C
C.
The dynamics of
y
i
y_i
yi points can analogy to
N
N
N-degree spring system, where
k
=
2
∑
j
(
p
j
∣
i
−
q
j
∣
i
+
p
i
∣
j
−
q
i
∣
j
)
δ
C
δ
y
i
=
k
(
y
i
−
y
j
)
k = 2 \sum_{j}\left(p_{j \mid i}-q_{j \mid i}+p_{i \mid j}-q_{i \mid j}\right) \\ \frac{\delta C}{\delta y_{i}}= k \left(y_{i}-y_{j}\right)
k=2j∑(pj∣i−qj∣i+pi∣j−qi∣j)δyiδC=k(yi−yj)
if
y
i
y_i
yi faithful to
x
i
x_i
xi, have
q
j
∣
i
≈
p
j
∣
i
q_{j \mid i} \approx p_{j \mid i}
qj∣i≈pj∣i and
q
i
∣
j
≈
p
i
∣
j
q_{i \mid j} \approx p_{i \mid j}
qi∣j≈pi∣j, similarity is equal, that
k
=
0
,
δ
C
δ
y
i
=
0
k=0,\frac{\delta C}{\delta y_{i}}=0
k=0,δyiδC=0. Before this, the spring system will try to cluster itself.
Updata of y i y_i yi
Y
(
t
)
=
Y
(
t
−
1
)
+
η
δ
C
δ
Y
+
α
(
t
)
(
Y
(
t
−
1
)
−
Y
(
t
−
2
)
)
\mathcal{Y}^{(t)}=\mathcal{Y}^{(t-1)}+\eta \frac{\delta C}{\delta \mathcal{Y}}+\alpha(t)\left(\mathcal{Y}^{(t-1)}-\mathcal{Y}^{(t-2)}\right)
Y(t)=Y(t−1)+ηδYδC+α(t)(Y(t−1)−Y(t−2))
where
Y
(
t
)
\mathcal{Y}^{(t)}
Y(t) indicates the solution at iteration t,
η
\eta
η indicates the learning rate, and
α
(
t
)
\alpha(t)
α(t) represents the momentum at iteration t.
How the mapping points moving in
y
y
y-plane with
t
t
t?
t-Distribution SNE
Difference to SNE:
- symmetrized version of the SNE cost function (set p i ∣ j = p j ∣ i , q i ∣ j = q j ∣ i p_{i|j}= p_{j|i},q_{i|j}= q_{j|i} pi∣j=pj∣i,qi∣j=qj∣i)
- uses a Student-t distribution rather than a Gaussian to compute the similarity ( q i j q_{ij} qij) between two points in the y i y_i yi space
Symmetrized the similarity
p
i
j
=
p
j
∣
i
+
p
i
∣
j
2
n
q
i
j
=
q
j
∣
i
+
q
i
∣
j
2
n
p_{i j}=\frac{p_{j \mid i}+p_{i \mid j}}{2 n} \\ q_{i j}=\frac{q_{j \mid i}+q_{i \mid j}}{2 n}
pij=2npj∣i+pi∣jqij=2nqj∣i+qi∣j
Cost function become
δ
C
δ
y
i
=
4
∑
j
(
p
i
j
−
q
i
j
)
(
y
i
−
y
j
)
\frac{\delta C}{\delta y_{i}}=4 \sum_{j}\left(p_{i j}-q_{i j}\right)\left(y_{i}-y_{j}\right)
δyiδC=4j∑(pij−qij)(yi−yj)
“Crowding Problem”
For example, map the m m m-dim data into 2 2 2-dim, in somewhere data dense area, the volume of x x x-plane is ∼ r m \sim r^m ∼rm, but the volume in y y y-plane is ∼ r 2 \sim r^2 ∼r2.
Solve
In the process of minimizing, add some guassian noise to simulate annealing, then add a slight repulsion to all springs.
The repulsion term is
q
i
j
(
−
k
)
q_{ij} \ (-k)
qij (−k), set
q
i
j
>
2
ρ
n
(
n
−
1
)
q_{ij} > \frac{2 \rho}{n(n-1)}
qij>n(n−1)2ρ
The minimize process is:
Start (
ρ
=
0
\rho=0
ρ=0)
→
\to
→ introduce gaussian noise in some mid-t steps
→
\to
→ introduce backgroup repulsion (
ρ
>
0
\rho > 0
ρ>0)
→
\to
→ Gap of cluster appear.
Problem of directly optimize UNI-SNE
If add the repulsion at begin, consider large distance points
y
i
,
y
j
y_i,y_j
yi,yj, it’s
q
i
j
∼
2
ρ
n
(
n
−
1
)
q_{ij} \sim \frac{2 \rho}{n(n-1)}
qij∼n(n−1)2ρ
nearly same the background repulsion. Small change of
∥
y
i
−
y
j
∥
\left\|y_{i}-y_{j}\right\|
∥yi−yj∥ will no influence on Cost fuction
C
C
C,
q
i
j
q_{ij}
qij still
∼
2
ρ
n
(
n
−
1
)
\sim \frac{2 \rho}{n(n-1)}
∼n(n−1)2ρ.
Student t-distribution
q
i
j
=
(
1
+
∥
y
i
−
y
j
∥
2
)
−
1
∑
k
≠
l
(
1
+
∥
y
k
−
y
l
∥
2
)
−
1
q_{i j}=\frac{\left(1+\left\|y_{i}-y_{j}\right\|^{2}\right)^{-1}}{\sum_{k \neq l}\left(1+\left\|y_{k}-y_{l}\right\|^{2}\right)^{-1}}
qij=∑k=l(1+∥yk−yl∥2)−1(1+∥yi−yj∥2)−1
In high-dim (
x
x
x-plane) still use a Gussian distribution, but in
y
y
y-plane use the above one.
Whne
∥
y
i
−
y
j
∥
\left\|y_{i}-y_{j}\right\|
∥yi−yj∥ large, than
q
i
j
∼
1
/
∥
y
i
−
y
j
∥
2
∼
long range force
1
/
r
2
q_{ij} \sim 1/\left\|y_{i}-y_{j}\right\|^2 \sim \text{long range force } 1/r^2
qij∼1/∥yi−yj∥2∼long range force 1/r2
New gradient:
δ
C
δ
y
i
=
4
∑
j
(
p
i
j
−
q
i
j
)
(
y
i
−
y
j
)
(
1
+
∥
y
i
−
y
j
∥
2
)
−
1
\frac{\delta C}{\delta y_{i}}=4 \sum_{j}\left(p_{i j}-q_{i j}\right)\left(y_{i}-y_{j}\right)\left(1+\left\|y_{i}-y_{j}\right\|^{2}\right)^{-1}
δyiδC=4j∑(pij−qij)(yi−yj)(1+∥yi−yj∥2)−1
Optimization Trick
“early compression”: force the map points to stay close together at the start of the optimization, make the clusters easier run/explore the global space of possible.
“early exaggeration” (used in t-SNE): multiply all of the p i j p_{i j} pij by a factor, for example, 4, in the initial stages of the optimization. Data tend to form tight widely separated clusters in the map. Makes points much easier move in cluster.
computational bottleneck
The
p
i
j
,
q
i
j
p_{ij},q_{ij}
pij,qij are point to points, so as data points increase, computational amount will go as
n
(
n
−
1
)
/
2
∼
n
2
n(n-1)/2 \sim n^2
n(n−1)/2∼n2
For the data pointa
≥
10
,
000
\ge 10,000
≥10,000 will be time-consuming, one solution is random sampling a sub-set.
Ramda wlk
Random walk-based affinity measure. More efficient, faster. prevent “short-cut” problem.A,B,C are equidistant, but A,B in same cluster, while C in another cluster. The single noise point between two different clusters will not influence the cluster partitioning, because the similarity of two point ( x i , x j x_i,x_j xi,xj) is not by the shortest path, while is integration of all path.
The emission probability from
x
i
→
x
j
x_i \to x_j
xi→xj is
e
−
∥
x
i
−
x
j
∥
2
e^{-\left\|x_{i}-x_{j}\right\|^{2}}
e−∥xi−xj∥2
Other data dimension reduction methods
- PCA 1979
- classical scaling. Minimze the Cost fuction
C = 1 ∑ i j ∥ x i − x j ∥ ∑ i ≠ j ( ∥ x i − x j ∥ − ∥ y i − y j ∥ ) 2 ∥ x i − x j ∥ = 1 N ∑ i ≠ j σ i j 2 d i j C=\frac{1}{\sum_{i j}\left\|x_{i}-x_{j}\right\|} \sum_{i \neq j} \frac{\left(\left\|x_{i}-x_{j}\right\|-\left\|y_{i}-y_{j}\right\|\right)^{2}}{\left\|x_{i}-x_{j}\right\|} = \frac{1}{N} \sum_{i \neq j} \frac{\sigma_{ij}^2}{d_{ij}} C=∑ij∥xi−xj∥1i=j∑∥xi−xj∥(∥xi−xj∥−∥yi−yj∥)2=N1i=j∑dijσij2
if ∥ x i − x j ∥ \left\|x_{i}-x_{j}\right\| ∥xi−xj∥ is large, then ∥ y i − y j ∥ \left\|y_{i}-y_{j}\right\| ∥yi−yj∥ is large but also shrink by a factor 1 ∥ x i − x j ∥ \frac{1}{\left\|x_{i}-x_{j}\right\|} ∥xi−xj∥1. It will shrink large distance, amplify short distance in high-dim data. - Sammon mapping (Sammon, 1969)
- t-SNE.
- Isomap
- LLE…
Diffusion method
Diffusion-based interpretation: the mapping pioints distance ∥ y i − y j ∥ \left\|y_{i}-y_{j}\right\| ∥yi−yj∥ should be same as diffusion distance.
Definiton of diffusion distance:
D
(
t
)
(
x
i
,
x
j
)
=
∑
k
(
p
i
k
(
t
)
−
p
j
k
(
t
)
)
2
ψ
(
x
k
)
(
0
)
D^{(t)}\left(x_{i}, x_{j}\right)=\sqrt{\sum_{k} \frac{\left(p_{i k}^{(t)}-p_{j k}^{(t)}\right)^{2}}{\psi\left(x_{k}\right)^{(0)}}}
D(t)(xi,xj)=k∑ψ(xk)(0)(pik(t)−pjk(t))2
Cost function:
C
=
∑
i
∑
j
(
D
(
t
)
(
x
i
,
x
j
)
−
∥
y
i
−
y
j
∥
)
2
C=\sum_{i} \sum_{j}\left(D^{(t)}\left(x_{i}, x_{j}\right)-\left\|y_{i}-y_{j}\right\|\right)^{2}
C=i∑j∑(D(t)(xi,xj)−∥yi−yj∥)2
where
p
i
j
(
t
)
p_{i j}^{(t)}
pij(t) represents the probability of a particle traveling from
x
i
x_i
xi to
x
j
x_j
xj in t timesteps with Gaussian emission probabilities,
ψ
(
x
k
)
(
0
)
\psi\left(x_{k}\right)^{(0)}
ψ(xk)(0) is a measure for the local density of the points.
how x i x_i xi particle travel to x j x_j xj ?
Emission probability of
x
i
→
x
j
x_i \to x_j
xi→xj (arrow line) is equal
e
−
∥
x
i
−
x
j
∥
2
/
σ
2
e^{-\left\|x_{i}-x_{j}\right\|^2/\sigma^2}
e−∥xi−xj∥2/σ2 .
For solid line,
x
i
→
x
j
x_i \to x_j
xi→xj will take 4 time-steps, but high probability.
For dashed line,
x
i
→
x
j
x_i \to x_j
xi→xj will take 1 time-step, but low probability.
Assign much higher importance to the large diffusion distances rather than small, not good at retaining the local structure.
Weakness of t-SNE
- Unclear the porform in high-dim y y y. (dimension of y > 3 y>3 y>3)
- Sensitive to the intrinsic dimension of data (dimension of x i > 100 x_i>100 xi>100 maybe less successful).
- Not guranteed converge the Cost function.
Experimental
compare with other methods
Visualizations of 6,000 handwritten digits from the MNIST data set.
classify task
64 pixels hand writes digits
Input shape, 1797 pictures, 64 pixels.
Iris flower classify
Input characteristic of iris flower
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Input shape, (150,4)