Introduce
Base on Visualizing Data using t-SNE.
How to Use t-SNE Effectively
example code
Student t-distribution Stochastic Neighbor Embedding (t-SNE) is a kinds of vsiualize tools, also a clssifier for high-dimmetion(high-dim) data (well proform in N < 100 N<100 N<100), which can embed and clssify the high-dim data into 2 or 3-dim plane clusters. It’s a kinds of unsupervise Mechine Learing (ML) technique.
Classifying of data, essentially is a kinds of dimensionality reduction, ML in here is used to, mininmize something (cost function) with larger parameters and non-linear.
Stochastic Neighbor Embedding (SNE), Basis idea of SNE and it’s math
SNE is expeted to find a faithful representation in low-dim for high-dim date, which preserve the small daistant (local) structure and reflect the large daistant of high-dim data.
conditional probability
Assume the one sample of high-dim data can be write down as a vector y i y_i yi and it’s mapping point in low-dim is x i x_i xi. Then define to conditional probability p j ∣ i , q j ∣ i p_{j|i}, q_{j|i} pj∣i,qj∣i :
x i high-dim vector, a fixed piont as real data in high-dim y i mapping low-dim vector, movable and mapping point in low-dim x_i \ \ \ \ \text{high-dim vector, a fixed piont as real data in high-dim} \\ y_i \ \ \ \ \text{mapping low-dim vector, movable and mapping point in low-dim} xi high-dim vector, a fixed piont as real data in high-dimyi mapping low-dim vector, movable and mapping point in low-dim
p j ∣ i = exp ( − ∥ x i − x j ∥ 2 / 2 σ i 2 ) ∑ k ≠ i exp ( − ∥ x i − x k ∥ 2 / 2 σ i 2 ) , p i ∣ i = 0 p_{j \mid i}=\frac{\exp \left(-\left\|x_{i}-x_{j}\right\|^{2} / 2 \sigma_{i}^{2}\right)}{\sum_{k \neq i} \exp \left(-\left\|x_{i}-x_{k}\right\|^{2} / 2 \sigma_{i}^{2}\right)}, \ \ p_{i \mid i}= 0 pj∣i=∑k=iexp(−∥xi−xk∥2/2σi2)exp(−∥xi−xj∥2/2σi2), pi∣i=0
q j ∣ i = exp ( − ∥ y i − y j ∥ 2 ) ∑ k ≠ i exp ( − ∥ y i − y k ∥ 2 ) , σ = 1 2 , q i ∣ i = 0 q_{j \mid i}=\frac{\exp \left(-\left\|y_{i}-y_{j}\right\|^{2}\right)}{\sum_{k \neq i} \exp \left(-\left\|y_{i}-y_{k}\right\|^{2}\right)}, \ \ \sigma = \frac{1}{\sqrt{2}} , \ \ q_{i \mid i}= 0 qj∣i=∑k=iexp(−∥yi−yk∥2)exp(−∥yi−yj∥2), σ=21, qi∣i=0
where ∥ x i − x j ∥ \left\|x_{i}-x_{j}\right\| ∥xi−xj∥ is the Euclidean distances. The s i m i l a r i t y similarity similarity of datapoint x j x_j xj to datapoint x i x_i xi is the conditional probability, p j ∣ i p_{j|i} pj∣i, that x i x_i xi would pick x j x_j xj as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at x i x_i xi. For example, for larger distance, p j ∣ i ∼ 0 p_{j|i} \sim 0 pj∣i∼0. The σ i \sigma_i σi here, perform a rescaling effect.
If y i y_i yi faithful to x i x_i xi, that should be q j ∣ i ≈ p j ∣ i q_{j \mid i} \approx p_{j \mid i} qj∣i≈pj∣i
The Cost function
It’s Kullback- Leibler divergence (which is in this case equal to the cross-entropy up to an additive constant, also named relative entropy and well uesd in such as A d s − C F T Ads-CFT Ads−CFT Holographic theory, is equivalent to the difference between the Shannon entropy of two probability distributions).
C = ∑ i K L ( P i ∥ Q i ) = ∑ i ∑ j p j ∣ i log p j ∣ i q j ∣ i C=\sum_{i} K L\left(P_{i} \| Q_{i}\right)=\sum_{i} \sum_{j} p_{j \mid i} \log \frac{p_{j \mid i}}{q_{j \mid i}} C=i∑KL(Pi∥Qi)=i∑j∑pj∣ilogqj∣ipj∣i
Shannon entropy is S = − ∑ j p j log p j S = -\sum_{j} p_{j } \log {p_{j}} S=−∑jpjlogpj
∑ j p j ∣ i log p j ∣ i q j ∣ i = ∑ j p j ∣ i log p j ∣ i − ∑ j p j ∣ i log q j ∣ i \sum_{j} p_{j \mid i} \log \frac{p_{j \mid i}}{q_{j \mid i}} = \sum_{j} p_{j \mid i} \log {p_{j \mid i}} - \sum_{j} p_{j \mid i} \log {q_{j \mid i}} j∑pj∣ilogqj∣ipj∣i=j∑pj∣ilogpj∣i−j∑pj∣i