Mutual Information

最新推荐文章于 2024-05-18 11:05:50 发布

NorburyL

最新推荐文章于 2024-05-18 11:05:50 发布

阅读量398

点赞数

分类专栏：机器学习文章标签：机器学习深度学习人工智能

本文链接：https://blog.csdn.net/sherlocklcy/article/details/127140734

版权

机器学习专栏收录该内容

3 篇文章 0 订阅

订阅专栏

3.2 Mutual Information

参考文献
	Self-Supervised Learning of Graph Neural Networks: A Unified Review
		http://arxiv.org/abs/2102.10757
	https://blog.csdn.net/haolexiao/article/details/70142571?spm=1001.2014.3001.5506

$I$ the amount of information
- indicates the length of encoding required for the message
- Calculation formula
  $I=log(\frac{1}{p(x)})=−log(p(x))$
  - $p (x)$ represents the probability of information $x$
  - the higher the frequency of information, the smaller its length, that is,
    - the smaller amount of information.
  - ```
  p(a) = 50%, 
  p(b) = 20%, 
  
  I_a < I_b
```
$H (.)$ denote Information entropy
- Mathematical Expectation of Information $I$ of distribution $p$
- Calculation formula
  $\underset{x}{\sum}p(x)log(\frac{1}{p(x)}) = -\underset{x}{\sum}p(x)log(p(x))$
  - $p$ denote distribution
$H_q(p)$ cross-entropy
- encodes its true distribution $p$ by encoding a guessed distribution $q$ ,
  - Distributions may be understood as labels
  $H_q(p) = H(p,q)= \underset{x}{\sum}p(x)log(\frac{1}{q(x)}) = -\underset{x}{\sum}p(x)log(q(x))$
  
  $\mathcal{L} = -\underset{i}{\sum}[y_ilog(q(x_i)) + (1-y_i)log(1-(q(x_i))] \tag{Machine Learning}$
  - Cross-entropy is often used as the final loss function in the field of machine learning
  - Cross-entropy is essentially a measure of the difference
    - between the two encodings,
    - because the smaller the value is only
      - when the guessed distribution is approximately close to the true distribution.
KL Divergence
- KL divergence/distance is a measure of the distance between
  - two distributions,
- The KL distance is generally called the relative entropy of $q$ to $p$
  - by $D (p ∣∣ q)$
    $H_q(p)-H(p) = \underset{x}{\sum}p(x)log(\frac{1}{q(x)}) - \underset{x}{\sum}p(x)log(\frac{1}{p(x)})\\ =-\underset{x}{\sum}p(x)log\frac{q(x)}{p(x)}$
The relationship between two variables in a joint distribution
- (that is, in the same distribution) affecting each other.
  $H (X, Y) = H (X) + H (Y ∣ X) = H (Y) + H (X ∣ Y)$
- Joint information entropy:
  $\underset{x,y}{\sum}p(x,y)log(\frac{1}{p(x,y)}) = -\underset{x,y}{\sum}p(x,y)log(p(x,y))$
- Conditional information entropy:
  $\underset{y}{\sum}p(y)\underset{x}{\sum}p(x|y)log(\frac{1}{p(x|y)}) \\ = \underset{x,y}{\sum}p(x,y)log(\frac{1}{p(x|y)}) = -\underset{x,y}{\sum}p(x,y)log(p(x|y))$

3.2.1 Mutual Information Estimation

effect
- mutual information measures the correlation between two random variables:
  $\mathcal{I}(x,y) = H(x) - H(x|y) = H(y) - H(y|x) = H(x) + H(y) - H(x,y)\\ = D_{KL}(p(x,y)||p(x)p(y))\\$
  - $H (.)$ denote Information entropy
    - represents the amount of information in a distribution, or the average length of the code
Given a pair of random variables $(x, y)$ , the mutual information $\mathcal{I}(x, y)$ measures
- the information that $x$ and $y$ share,
$\mathcal{I}(x,y) = D_{KL}(p(x,y)||p(x)p(y))\\ =\mathbb{E}_{p(x,y)}[log~\frac{p(x,y)}{p(x)p(y)}], \tag{9~10}$
- where $D_{KL}$ denotes the Kullback-Leibler (KL) divergence.
The contrastive learning
- target
  - seeks to maximize the mutual information
  - between two views as two random variables.
- practice
  - it trains the encoders to be contrastive between
    - representations of a positive pair of views that comes from the joint distribution $p(v_i, v_j)$
    - and representations of a negative pair of views that comes from the product of marginals $p(v_i )p(v_j)$ .
In order to computationally estimate and maximize the mutual information in the contrastive learning,
- three typical lower-bounds to the mutual information
  - namely, the Donsker-Varadhan representation $\mathcal{I}^{(DV)}$ ,
  - the Jensen-Shannon estimator $\mathcal{I}^{(JS)}$ ,
  - and the noisecontrastive estimation ̂ $\mathcal{I}^{(NCE)}$ ,
- Among the three lower-bounds, $\mathcal{I}^{(JS)}$ and $\mathcal{I}^{(NCE)}$ are commonly used as objectives
  - in the contrastive learning in graphs.

Discriminator

A mutual information estimation is usually computed
- based on a discriminator $D:\mathbb{R}^q×\mathbb{R}^q→\mathbb{R}$ that
  - maps the representations of two views to an agreement score
    - between the two representations.
The discriminator $D$ can be either parametric or non-parametric.
- For example, the discriminator can optionally apply a set of projection heads to the representations $h_1,···,h_k$
  - before computing the pairwise similarity.
We formalize the optional projection heads as $g_1, · · · , g_k$ such that
$z_i=g_i(h_i),i=1,...,k, \tag{11}$
- where $g_i$ can be an identical mapping, a linear projection or an MLP.
- Parameterized $g_i$ are optimized simultaneously
  - with the encoders $f_i$ in Eqn. (8), given by
    $\underset{\{f_i,g_i\}^k_{i=1}}{max} \frac{1}{\sum_{i\neq j} \alpha_{ij}} \Big[\sum_{i\neq j}\alpha_{i,j} \mathcal{\hat{I}_{g_i,g_j}(h_i,h)} \Big], \tag{12}$

3.2.2 Donsker-Varadhan Estimator

The Donsker-Varadhan (DV) estimator,
- also knwon as the DV representation of the KL divergence,
  - is a lower-bound to the mutual information
- target
  - and hence can be applied to maximize the mutual information.
Given $h_i$ and $h_j$ , the lower-bound is computed as
$\hat{\mathcal{I}}(h_i.h_j)=\mathbb{E}_{p(h_i,h_j)}[\mathcal{D}(h_i,h_j)]-log\mathbb{E}_{p(h_i)p(h_j)}[e^{\mathcal{D}(h_i,h_j)}] \tag{13}$

$\mathbb{E}_{p(h_i,h_j)} = \underset{h_i,h_j}{\sum}p(h_i,h_j)\\ \mathbb{E}_{p(h_i)p(h_j)} = \underset{h_i,h_j}{\sum}p(h_i)p(h_j)\\$
- where $p(h_i, h_j)$ denotes the joint distribution of the two representations $h_i$ , $h_j$
- and $p(h_i)p(h_j)$ denotes the product of marginals.
- $\mathbb{E}_{p(h_i,h_j)}$ denotes the Expectation of $p(h_i,h_j)$
For simplicity and to include the graph data distribution $P$ ,
- we assume transformations $\mathcal{T}_i$ to be deterministic
- and encoders $f_i$ to be injective,
  - and have $p(h_i,h_j)=p(h_i)p(h_j|h_i)=p(f_i(\mathcal{T}_i(A,X)))p(f_j(\mathcal{T}_j(A,X))∣∣(A, X))$ We hence re-write Eqn. (13) as
    $\hat{\mathcal{I}}^{(DV)}(h_i.h_j)=\mathbb{E}_{(A,X)∼\mathcal{P}}[\mathcal{D}(h_i,h_j)]-log~\mathbb{E}_{[(A,X),(A`,X`)]∼\mathcal{P}\times \mathcal{P}}[e^{\mathcal{D}(h_i,h'_j)}] \tag{14}$
    - where $h_i$ and $h_j$ in the first term
      - are computed from $(A, X)$ distributed from $P$ ,
    - $h_i$ and $h′_j$ in the second term
      - are computed from $(A, X)$ and $(A', X')$ identically and independently distributed from $P$ , respectively.
    - In following descriptions of other objectives, we use the later version that includes $P$ .

3.2.3 Jensen-Shannon Estimator

Compared to the Donsker-Varadhan estimator,
- the JensenShannon estimator enables more efficient estimation
  - and optimization of the mutual information
    - by computing the JS-divergence between
      - the joint distribution and the product of marginals.
Given two representations $h_i$ and $h_j$ computed from the random variable $(A, X)$ and a discriminator $D$ ,
- DGI , InfoGraph, and Hassani and Khasahmadi computes the JS estimator
  $\hat{\mathcal{I}}^{(JS)}(h_i.h_j)=\mathbb{E}_{(A,X)∼\mathcal{P}}[log(\mathcal{D}(h_i,h_j))]-\mathbb{E}_{[(A,X),(A`,X`)]∼\mathcal{P}\times \mathcal{P}}[log(1-\mathcal{D}(h_i,h'_j))] \tag{15}$
- where $h_i, h_j$ in the first term are computed from $(A, X)$ distributed from P,
- $h_i$ and $h′_j$ in the second term are computed from $(A, X)$ and $(A', X')$
- identically and independently distributed
  - from the distribution $P$ .
Note that depict a softplus version of the JS estimator,
$\hat{\mathcal{I}}^{(JS-SP)}(h_i.h_h)=\mathbb{E}_{(A,X)∼\mathcal{P}}[-sp(\mathcal{D}(h_i,h_j))]-\mathbb{E}_{[(A,X),(A`,X`)]∼\mathcal{P}\times \mathcal{P}}[sp(\mathcal{D'}(h_i,h'_j))] \tag{16}$
- where $sp(x) = log(1 + e^x)$ .
We consider the JS estimators in Eqn. (15) and Eqn. (16) to be equivalent
- by letting $D(h_i, h_j) = sigmoid(D′(h_i, h_j))$ .
For the the negative pairs of graphs $[(A, X), (A', X')] \sim P \times P$ in particular,
- DGI samples one graph $(A, X)$ from the training dataset
  - and applies a stochastic corruption $C$ to obtain $(A', X') = C (A, X)$ .
The other studies independently sample two graphs from the training dataset.
Discriminators in JS estimators usually compute the agreement score
- between two vectors
  - by their inner product with sigmoid, i.e.,
    - $\mathcal{D}(h_i,h_j)=sigmoid(z^T_i,z_j)=sigmoid(g_i(h_i)^Tg_j(h_j))$ .

3.2.4 InfoNCE

$\hat{\mathcal{I}}^{(NCE)}$ is another lower-bound to the mutual information $\mathcal{I}$ .
Given the representations $h_i$ and $h_j$ of two views of random variable $(A, X)$ , the discriminator $\mathcal{D}$ , and the number of negative samples $N$ ,
- the InfoNCE is formalized as
  $\hat{\mathcal{I}}^{(NCE)}(h_i.h_h)=\mathbb{E}_{(A,X)∼\mathcal{P}}[\mathcal{D}(h_i,h_j)- \mathbb{E}_{K∼\mathcal{P}^K}[log\underset{(A',X')\in K}{\sum}e^{\mathcal{D}(h_i,h'_j)/N}|(A,X)] ] \\ = \mathbb{E}_{[(A,X),K]∼\mathcal{P}\times \mathcal{P}^N} [log\frac{e^{\mathcal{D}(h_i,h_j)}}{\sum_{x'\in B/\{x\}}e^{\mathcal{D}(h_i,h'_j)}}] + log~N \tag{17}$
  - where $K$ consists of N random variables identically and independently distributed from $P$ ,
  - $h_i, h_j$ are the representations of the $i - t h$ and $j - t h$ views of $(A, X),$
  - and $h′_j$ is the representation of the $j - t h$ view of $(A', X')$ .
In practice, we compute the InfoNCE on mini-batches of size $N + 1$ .
For each sample $x$ in a mini-batch $B$ ,
- we consider the set of the rest $N$ samples as a sample of $K$ .
- We then discard the constant term $l o g N$ in Eqn. (17) and minimize the loss
  $\mathcal{L}_{InfoNCB}=-\frac{1}{N+1}\underset{x\in B}{\sum}[log\frac{e^{\mathcal{D}(h_i,h_j)}}{\sum_{x'\in B/\{x\}}e^{\mathcal{D}(h_i,h'_j)}}] \tag{18}$
  - Intuitively, the optimization of InfoNCE loss aims to score the agreement between
    - $h_i$ and $h_j$ of views
      - from the same instance $x$ higher than
        between $h_i$ and $h′_j$ from the rest $N$ negative samples $B /\{x\}$ .
Discriminators in typical InfoNCE compute the agreement score between two vectors
- by their inner product, i.e., $\mathcal{D}(h_i,h_j) = z^T_iz_j = g_i(h_i)^Tg_j(h_j)$ .
A specific type of the InfoNCE loss, known as the NT-Xent loss,
- includes a preset temperature parameter $τ$ in the computation of discriminator $D$
  - in the InfoNCE loss, i.e., $D(hi, hj) = gi(h_i)^T gj(h_j)/τ$ .
In addition, the discriminator in You et al.
- computes the agreement score between vectors with normalizations,
  - i.e., $D(h_i,h_j)=\frac{g_i(h_i)^Tg_j(h_j)/τ}{‖g_i(h_i)‖‖g_j(h_j)‖}$ ,
    - where $‖ \cdot ‖$ denotes the $l ‘2 - n or m$ .
Moreover, the Bayesian Personalized Ranking (BPR) loss is equivalent to the InfoNCE loss
- when letting $N = 1$ and $\mathcal{D}(h_i,h_j) = h^T_ih_j$ .

3.2.5 Non-Bound Mutual Information Estimators

background
- There are other objectives
  - that have been used in some studies,
effect
- and these objectives can also increase mutual information.
- However, these objectives are not provable lower-bounds
  - to the mutual information,
  - and optimizing these objectives does not
    - guarantee the maximization of the mutual information.
For example, Jiao et al. proposes to minimize the triplet margin loss ,
- which is commonly used in deep metric learning.
- Given representations $h_i, h_j$ and the discriminator $\mathcal{D}$ ,
- the triplet margin loss is formalized as
  $L_{triplet} = \mathbb{E}_{[(A,X),(A′,X′)]∼P×P} [max\{D(h_i,h_j)−D(h_i, h′_j)+ \epsilon, 0\}], \tag{19}$
  - where $D(h_i,h_j)=sigmoid(h^T_ih_j)$
  - and $\epsilon$ is the margin value.