3.2 Mutual Information
参考文献
Self-Supervised Learning of Graph Neural Networks: A Unified Review
http://arxiv.org/abs/2102.10757
https://blog.csdn.net/haolexiao/article/details/70142571?spm=1001.2014.3001.5506
-
I I I the amount of information
-
indicates the length of encoding required for the message
-
Calculation formula
I = l o g ( 1 p ( x ) ) = − l o g ( p ( x ) ) I=log(\frac{1}{p(x)})=−log(p(x)) I=log(p(x)1)=−log(p(x))-
p ( x ) p(x) p(x) represents the probability of information x x x
-
the higher the frequency of information, the smaller its length, that is,
- the smaller amount of information.
-
p(a) = 50%, p(b) = 20%, I_a < I_b
-
-
-
H ( . ) H(.) H(.) denote Information entropy
-
Mathematical Expectation of Information I I I of distribution p p p
-
Calculation formula
H ( p ) = ∑ x p ( x ) l o g ( 1 p ( x ) ) = − ∑ x p ( x ) l o g ( p ( x ) ) H(p) = \underset{x}{\sum}p(x)log(\frac{1}{p(x)}) = -\underset{x}{\sum}p(x)log(p(x)) H(p)=x∑p(x)log(p(x)1)=−x∑p(x)log(p(x))- p p p denote distribution
-
-
H q ( p ) H_q(p) Hq(p) cross-entropy
-
encodes its true distribution p p p by encoding a guessed distribution q q q,
- Distributions may be understood as labels
H q ( p ) = H ( p , q ) = ∑ x p ( x ) l o g ( 1 q ( x ) ) = − ∑ x p ( x ) l o g ( q ( x ) ) H_q(p) = H(p,q)= \underset{x}{\sum}p(x)log(\frac{1}{q(x)}) = -\underset{x}{\sum}p(x)log(q(x)) Hq(p)=H(p,q)=x∑p(x)log(q(x)1)=−x∑p(x)log(q(x))
L = − ∑ i [ y i l o g ( q ( x i ) ) + ( 1 − y i ) l o g ( 1 − ( q ( x i ) ) ] (Machine Learning) \mathcal{L} = -\underset{i}{\sum}[y_ilog(q(x_i)) + (1-y_i)log(1-(q(x_i))] \tag{Machine Learning} L=−i∑[yilog(q(xi))+(1−yi)log(1−(q(xi))](Machine Learning)
- Cross-entropy is often used as the final loss function in the field of machine learning
- Cross-entropy is essentially a measure of the difference
- between the two encodings,
- because the smaller the value is only
- when the guessed distribution is approximately close to the true distribution.
-
-
KL Divergence
-
KL divergence/distance is a measure of the distance between
- two distributions,
-
The KL distance is generally called the relative entropy of q q q to p p p
- by
D
(
p
∣
∣
q
)
D(p||q)
D(p∣∣q)
D ( p ∣ ∣ q ) = H q ( p ) − H ( p ) = ∑ x p ( x ) l o g ( 1 q ( x ) ) − ∑ x p ( x ) l o g ( 1 p ( x ) ) = − ∑ x p ( x ) l o g q ( x ) p ( x ) D(p||q) = H_q(p)-H(p) = \underset{x}{\sum}p(x)log(\frac{1}{q(x)}) - \underset{x}{\sum}p(x)log(\frac{1}{p(x)})\\ =-\underset{x}{\sum}p(x)log\frac{q(x)}{p(x)} D(p∣∣q)=Hq(p)−H(p)=x∑p(x)log(q(x)1)−x∑p(x)log(p(x)1)=−x∑p(x)logp(x)q(x)
- by
D
(
p
∣
∣
q
)
D(p||q)
D(p∣∣q)
-
-
The relationship between two variables in a joint distribution
-
(that is, in the same distribution) affecting each other.
H ( X , Y ) = H ( X ) + H ( Y ∣ X ) = H ( Y ) + H ( X ∣ Y ) H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y) H(X,Y)=H(X)+H(Y∣X)=H(Y)+H(X∣Y) -
Joint information entropy:
H ( x , y ) = ∑ x , y p ( x , y ) l o g ( 1 p ( x , y ) ) = − ∑ x , y p ( x , y ) l o g ( p ( x , y ) ) H(x,y) = \underset{x,y}{\sum}p(x,y)log(\frac{1}{p(x,y)}) = -\underset{x,y}{\sum}p(x,y)log(p(x,y)) H(x,y)=x,y∑p(x,y)log(p(x,y)1)=−x,y∑p(x,y)log(p(x,y)) -
Conditional information entropy:
H ( x ∣ y ) = ∑ y p ( y ) ∑ x p ( x ∣ y ) l o g ( 1 p ( x ∣ y ) ) = ∑ x , y p ( x , y ) l o g ( 1 p ( x ∣ y ) ) = − ∑ x , y p ( x , y ) l o g ( p ( x ∣ y ) ) H(x|y) = \underset{y}{\sum}p(y)\underset{x}{\sum}p(x|y)log(\frac{1}{p(x|y)}) \\ = \underset{x,y}{\sum}p(x,y)log(\frac{1}{p(x|y)}) = -\underset{x,y}{\sum}p(x,y)log(p(x|y)) H(x∣y)=y∑p(y)x∑p(x∣y)log(p(x∣y)1)=x,y∑p(x,y)log(p(x∣y)1)=−x,y∑p(x,y)log(p(x∣y))
-
3.2.1 Mutual Information Estimation
-
effect
-
mutual information measures the correlation between two random variables:
I ( x , y ) = H ( x ) − H ( x ∣ y ) = H ( y ) − H ( y ∣ x ) = H ( x ) + H ( y ) − H ( x , y ) = D K L ( p ( x , y ) ∣ ∣ p ( x ) p ( y ) ) \mathcal{I}(x,y) = H(x) - H(x|y) = H(y) - H(y|x) = H(x) + H(y) - H(x,y)\\ = D_{KL}(p(x,y)||p(x)p(y))\\ I(x,y)=H(x)−H(x∣y)=H(y)−H(y∣x)=H(x)+H(y)−H(x,y)=DKL(p(x,y)∣∣p(x)p(y))-
H
(
.
)
H(.)
H(.) denote Information entropy
- represents the amount of information in a distribution, or the average length of the code
-
H
(
.
)
H(.)
H(.) denote Information entropy
-
-
Given a pair of random variables ( x , y ) (x, y) (x,y), the mutual information I ( x , y ) \mathcal{I}(x, y) I(x,y) measures
- the information that x x x and y y y share,
I ( x , y ) = D K L ( p ( x , y ) ∣ ∣ p ( x ) p ( y ) ) = E p ( x , y ) [ l o g p ( x , y ) p ( x ) p ( y ) ] , (9 10) \mathcal{I}(x,y) = D_{KL}(p(x,y)||p(x)p(y))\\ =\mathbb{E}_{p(x,y)}[log~\frac{p(x,y)}{p(x)p(y)}], \tag{9~10} I(x,y)=DKL(p(x,y)∣∣p(x)p(y))=Ep(x,y)[log p(x)p(y)p(x,y)],(9 10)
- where D K L D_{KL} DKL denotes the Kullback-Leibler (KL) divergence.
-
The contrastive learning
- target
- seeks to maximize the mutual information
- between two views as two random variables.
- practice
- it trains the encoders to be contrastive between
- representations of a positive pair of views that comes from the joint distribution p ( v i , v j ) p(v_i, v_j) p(vi,vj)
- and representations of a negative pair of views that comes from the product of marginals p ( v i ) p ( v j ) p(v_i )p(v_j) p(vi)p(vj).
- it trains the encoders to be contrastive between
- target
-
In order to computationally estimate and maximize the mutual information in the contrastive learning,
-
three typical lower-bounds to the mutual information
- namely, the Donsker-Varadhan representation I ( D V ) \mathcal{I}^{(DV)} I(DV),
- the Jensen-Shannon estimator I ( J S ) \mathcal{I}^{(JS)} I(JS),
- and the noisecontrastive estimation ̂ I ( N C E ) \mathcal{I}^{(NCE)} I(NCE),
-
Among the three lower-bounds, I ( J S ) \mathcal{I}^{(JS)} I(JS) and I ( N C E ) \mathcal{I}^{(NCE)} I(NCE) are commonly used as objectives
- in the contrastive learning in graphs.
-
Discriminator
-
A mutual information estimation is usually computed
-
based on a discriminator D : R q × R q → R D:\mathbb{R}^q×\mathbb{R}^q→\mathbb{R} D:Rq×Rq→R that
- maps the representations of two views to an agreement score
- between the two representations.
- maps the representations of two views to an agreement score
-
-
The discriminator D D D can be either parametric or non-parametric.
- For example, the discriminator can optionally apply a set of projection heads to the representations
h
1
,
⋅
⋅
⋅
,
h
k
h_1,···,h_k
h1,⋅⋅⋅,hk
- before computing the pairwise similarity.
- For example, the discriminator can optionally apply a set of projection heads to the representations
h
1
,
⋅
⋅
⋅
,
h
k
h_1,···,h_k
h1,⋅⋅⋅,hk
-
We formalize the optional projection heads as g 1 , ⋅ ⋅ ⋅ , g k g_1, · · · , g_k g1,⋅⋅⋅,gk such that
z i = g i ( h i ) , i = 1 , . . . , k , (11) z_i=g_i(h_i),i=1,...,k, \tag{11} zi=gi(hi),i=1,...,k,(11)-
where g i g_i gi can be an identical mapping, a linear projection or an MLP.
-
Parameterized g i g_i gi are optimized simultaneously
- with the encoders
f
i
f_i
fi in Eqn. (8), given by
m a x { f i , g i } i = 1 k 1 ∑ i ≠ j α i j [ ∑ i ≠ j α i , j I ^ g i , g j ( h i , h ) ] , (12) \underset{\{f_i,g_i\}^k_{i=1}}{max} \frac{1}{\sum_{i\neq j} \alpha_{ij}} \Big[\sum_{i\neq j}\alpha_{i,j} \mathcal{\hat{I}_{g_i,g_j}(h_i,h)} \Big], \tag{12} {fi,gi}i=1kmax∑i=jαij1[i=j∑αi,jI^gi,gj(hi,h)],(12)
- with the encoders
f
i
f_i
fi in Eqn. (8), given by
-
3.2.2 Donsker-Varadhan Estimator
-
The Donsker-Varadhan (DV) estimator,
-
also knwon as the DV representation of the KL divergence,
- is a lower-bound to the mutual information
-
target
- and hence can be applied to maximize the mutual information.
-
-
Given h i h_i hi and h j h_j hj, the lower-bound is computed as
I ^ ( h i . h j ) = E p ( h i , h j ) [ D ( h i , h j ) ] − l o g E p ( h i ) p ( h j ) [ e D ( h i , h j ) ] (13) \hat{\mathcal{I}}(h_i.h_j)=\mathbb{E}_{p(h_i,h_j)}[\mathcal{D}(h_i,h_j)]-log\mathbb{E}_{p(h_i)p(h_j)}[e^{\mathcal{D}(h_i,h_j)}] \tag{13} I^(hi.hj)=Ep(hi,hj)[D(hi,hj)]−logEp(hi)p(hj)[eD(hi,hj)](13)E p ( h i , h j ) = ∑ h i , h j p ( h i , h j ) E p ( h i ) p ( h j ) = ∑ h i , h j p ( h i ) p ( h j ) \mathbb{E}_{p(h_i,h_j)} = \underset{h_i,h_j}{\sum}p(h_i,h_j)\\ \mathbb{E}_{p(h_i)p(h_j)} = \underset{h_i,h_j}{\sum}p(h_i)p(h_j)\\ Ep(hi,hj)=hi,hj∑p(hi,hj)Ep(hi)p(hj)=hi,hj∑p(hi)p(hj)
-
where p ( h i , h j ) p(h_i, h_j) p(hi,hj) denotes the joint distribution of the two representations h i h_i hi , h j h_j hj
-
and p ( h i ) p ( h j ) p(h_i)p(h_j) p(hi)p(hj) denotes the product of marginals.
-
E p ( h i , h j ) \mathbb{E}_{p(h_i,h_j)} Ep(hi,hj) denotes the Expectation of p ( h i , h j ) p(h_i,h_j) p(hi,hj)
-
-
For simplicity and to include the graph data distribution P P P,
-
we assume transformations T i \mathcal{T}_i Ti to be deterministic
-
and encoders f i f_i fi to be injective,
-
and have p ( h i , h j ) = p ( h i ) p ( h j ∣ h i ) = p ( f i ( T i ( A , X ) ) ) p ( f j ( T j ( A , X ) ) ∣ ∣ ( A , X ) ) p(h_i,h_j)=p(h_i)p(h_j|h_i)=p(f_i(\mathcal{T}_i(A,X)))p(f_j(\mathcal{T}_j(A,X))∣∣(A, X)) p(hi,hj)=p(hi)p(hj∣hi)=p(fi(Ti(A,X)))p(fj(Tj(A,X))∣∣(A,X)) We hence re-write Eqn. (13) as
I ^ ( D V ) ( h i . h j ) = E ( A , X ) ∼ P [ D ( h i , h j ) ] − l o g E [ ( A , X ) , ( A ‘ , X ‘ ) ] ∼ P × P [ e D ( h i , h j ′ ) ] (14) \hat{\mathcal{I}}^{(DV)}(h_i.h_j)=\mathbb{E}_{(A,X)∼\mathcal{P}}[\mathcal{D}(h_i,h_j)]-log~\mathbb{E}_{[(A,X),(A`,X`)]∼\mathcal{P}\times \mathcal{P}}[e^{\mathcal{D}(h_i,h'_j)}] \tag{14} I^(DV)(hi.hj)=E(A,X)∼P[D(hi,hj)]−log E[(A,X),(A‘,X‘)]∼P×P[eD(hi,hj′)](14)- where
h
i
h_i
hi and
h
j
h_j
hj in the first term
- are computed from ( A , X ) (A, X) (A,X) distributed from P P P,
-
h
i
h_i
hi and
h
′
j
h′_j
h′j in the second term
- are computed from ( A , X ) (A, X) (A,X) and ( A ′ , X ′ ) (A′, X′) (A′,X′) identically and independently distributed from P P P, respectively.
- In following descriptions of other objectives, we use the later version that includes P P P.
- where
h
i
h_i
hi and
h
j
h_j
hj in the first term
-
-
3.2.3 Jensen-Shannon Estimator
-
Compared to the Donsker-Varadhan estimator,
-
the JensenShannon estimator enables more efficient estimation
-
and optimization of the mutual information
-
by computing the JS-divergence between
- the joint distribution and the product of marginals.
-
-
-
-
Given two representations h i h_i hi and h j h_j hj computed from the random variable ( A , X ) (A, X) (A,X) and a discriminator D D D,
-
DGI , InfoGraph, and Hassani and Khasahmadi computes the JS estimator
I ^ ( J S ) ( h i . h j ) = E ( A , X ) ∼ P [ l o g ( D ( h i , h j ) ) ] − E [ ( A , X ) , ( A ‘ , X ‘ ) ] ∼ P × P [ l o g ( 1 − D ( h i , h j ′ ) ) ] (15) \hat{\mathcal{I}}^{(JS)}(h_i.h_j)=\mathbb{E}_{(A,X)∼\mathcal{P}}[log(\mathcal{D}(h_i,h_j))]-\mathbb{E}_{[(A,X),(A`,X`)]∼\mathcal{P}\times \mathcal{P}}[log(1-\mathcal{D}(h_i,h'_j))] \tag{15} I^(JS)(hi.hj)=E(A,X)∼P[log(D(hi,hj))]−E[(A,X),(A‘,X‘)]∼P×P[log(1−D(hi,hj′))](15) -
where h i , h j h_i, h_j hi,hj in the first term are computed from ( A , X ) (A, X) (A,X) distributed from P,
-
h i h_i hi and h ′ j h′_j h′j in the second term are computed from ( A , X ) (A, X) (A,X) and ( A ′ , X ′ ) (A′, X′) (A′,X′)
-
identically and independently distributed
- from the distribution P P P.
-
-
Note that depict a softplus version of the JS estimator,
I ^ ( J S − S P ) ( h i . h h ) = E ( A , X ) ∼ P [ − s p ( D ( h i , h j ) ) ] − E [ ( A , X ) , ( A ‘ , X ‘ ) ] ∼ P × P [ s p ( D ′ ( h i , h j ′ ) ) ] (16) \hat{\mathcal{I}}^{(JS-SP)}(h_i.h_h)=\mathbb{E}_{(A,X)∼\mathcal{P}}[-sp(\mathcal{D}(h_i,h_j))]-\mathbb{E}_{[(A,X),(A`,X`)]∼\mathcal{P}\times \mathcal{P}}[sp(\mathcal{D'}(h_i,h'_j))] \tag{16} I^(JS−SP)(hi.hh)=E(A,X)∼P[−sp(D(hi,hj))]−E[(A,X),(A‘,X‘)]∼P×P[sp(D′(hi,hj′))](16)- where s p ( x ) = l o g ( 1 + e x ) sp(x) = log(1 + e^x) sp(x)=log(1+ex).
-
We consider the JS estimators in Eqn. (15) and Eqn. (16) to be equivalent
- by letting D ( h i , h j ) = s i g m o i d ( D ′ ( h i , h j ) ) D(h_i, h_j) = sigmoid(D′(h_i, h_j)) D(hi,hj)=sigmoid(D′(hi,hj)).
-
For the the negative pairs of graphs [ ( A , X ) , ( A ′ , X ′ ) ] ∼ P × P [(A, X),(A′,X′)]∼P×P [(A,X),(A′,X′)]∼P×P in particular,
- DGI samples one graph
(
A
,
X
)
(A, X)
(A,X) from the training dataset
- and applies a stochastic corruption C C C to obtain ( A ′ , X ′ ) = C ( A , X ) (A′,X′) = C(A,X) (A′,X′)=C(A,X).
- DGI samples one graph
(
A
,
X
)
(A, X)
(A,X) from the training dataset
-
The other studies independently sample two graphs from the training dataset.
-
Discriminators in JS estimators usually compute the agreement score
- between two vectors
- by their inner product with sigmoid, i.e.,
- D ( h i , h j ) = s i g m o i d ( z i T , z j ) = s i g m o i d ( g i ( h i ) T g j ( h j ) ) \mathcal{D}(h_i,h_j)=sigmoid(z^T_i,z_j)=sigmoid(g_i(h_i)^Tg_j(h_j)) D(hi,hj)=sigmoid(ziT,zj)=sigmoid(gi(hi)Tgj(hj)).
- by their inner product with sigmoid, i.e.,
- between two vectors
3.2.4 InfoNCE
-
I ^ ( N C E ) \hat{\mathcal{I}}^{(NCE)} I^(NCE) is another lower-bound to the mutual information I \mathcal{I} I.
-
Given the representations h i h_i hi and h j h_j hj of two views of random variable ( A , X ) (A, X) (A,X), the discriminator D \mathcal{D} D, and the number of negative samples N N N ,
-
the InfoNCE is formalized as
I ^ ( N C E ) ( h i . h h ) = E ( A , X ) ∼ P [ D ( h i , h j ) − E K ∼ P K [ l o g ∑ ( A ′ , X ′ ) ∈ K e D ( h i , h j ′ ) / N ∣ ( A , X ) ] ] = E [ ( A , X ) , K ] ∼ P × P N [ l o g e D ( h i , h j ) ∑ x ′ ∈ B / { x } e D ( h i , h j ′ ) ] + l o g N (17) \hat{\mathcal{I}}^{(NCE)}(h_i.h_h)=\mathbb{E}_{(A,X)∼\mathcal{P}}[\mathcal{D}(h_i,h_j)- \mathbb{E}_{K∼\mathcal{P}^K}[log\underset{(A',X')\in K}{\sum}e^{\mathcal{D}(h_i,h'_j)/N}|(A,X)] ] \\ = \mathbb{E}_{[(A,X),K]∼\mathcal{P}\times \mathcal{P}^N} [log\frac{e^{\mathcal{D}(h_i,h_j)}}{\sum_{x'\in B/\{x\}}e^{\mathcal{D}(h_i,h'_j)}}] + log~N \tag{17} I^(NCE)(hi.hh)=E(A,X)∼P[D(hi,hj)−EK∼PK[log(A′,X′)∈K∑eD(hi,hj′)/N∣(A,X)]]=E[(A,X),K]∼P×PN[log∑x′∈B/{x}eD(hi,hj′)eD(hi,hj)]+log N(17)- where K K K consists of N random variables identically and independently distributed from P P P,
- h i , h j h_i, h_j hi,hj are the representations of the i − t h i-th i−th and j − t h j-th j−th views of ( A , X ) , (A, X), (A,X),
- and h ′ j h′_j h′j is the representation of the j − t h j-th j−th view of ( A ′ , X ′ ) (A′, X′) (A′,X′).
-
-
In practice, we compute the InfoNCE on mini-batches of size N + 1 N + 1 N+1.
-
For each sample x x x in a mini-batch B B B,
-
we consider the set of the rest N N N samples as a sample of K K K.
-
We then discard the constant term l o g N log N logN in Eqn. (17) and minimize the loss
L I n f o N C B = − 1 N + 1 ∑ x ∈ B [ l o g e D ( h i , h j ) ∑ x ′ ∈ B / { x } e D ( h i , h j ′ ) ] (18) \mathcal{L}_{InfoNCB}=-\frac{1}{N+1}\underset{x\in B}{\sum}[log\frac{e^{\mathcal{D}(h_i,h_j)}}{\sum_{x'\in B/\{x\}}e^{\mathcal{D}(h_i,h'_j)}}] \tag{18} LInfoNCB=−N+11x∈B∑[log∑x′∈B/{x}eD(hi,hj′)eD(hi,hj)](18)- Intuitively, the optimization of InfoNCE loss aims to score the agreement between
-
h
i
h_i
hi and
h
j
h_j
hj of views
- from the same instance
x
x
x higher than
- between h i h_i hi and h ′ j h′_j h′j from the rest N N N negative samples B / { x } B /\{x\} B/{x}.
- from the same instance
x
x
x higher than
-
h
i
h_i
hi and
h
j
h_j
hj of views
- Intuitively, the optimization of InfoNCE loss aims to score the agreement between
-
-
Discriminators in typical InfoNCE compute the agreement score between two vectors
- by their inner product, i.e., D ( h i , h j ) = z i T z j = g i ( h i ) T g j ( h j ) \mathcal{D}(h_i,h_j) = z^T_iz_j = g_i(h_i)^Tg_j(h_j) D(hi,hj)=ziTzj=gi(hi)Tgj(hj).
-
A specific type of the InfoNCE loss, known as the NT-Xent loss,
- includes a preset temperature parameter
τ
τ
τ in the computation of discriminator
D
D
D
- in the InfoNCE loss, i.e., D ( h i , h j ) = g i ( h i ) T g j ( h j ) / τ D(hi, hj) = gi(h_i)^T gj(h_j)/τ D(hi,hj)=gi(hi)Tgj(hj)/τ .
- includes a preset temperature parameter
τ
τ
τ in the computation of discriminator
D
D
D
-
In addition, the discriminator in You et al.
- computes the agreement score between vectors with normalizations,
- i.e.,
D
(
h
i
,
h
j
)
=
g
i
(
h
i
)
T
g
j
(
h
j
)
/
τ
‖
g
i
(
h
i
)
‖‖
g
j
(
h
j
)
‖
D(h_i,h_j)=\frac{g_i(h_i)^Tg_j(h_j)/τ}{‖g_i(h_i)‖‖g_j(h_j)‖}
D(hi,hj)=‖gi(hi)‖‖gj(hj)‖gi(hi)Tgj(hj)/τ ,
- where ‖ ⋅ ‖ ‖·‖ ‖⋅‖ denotes the l ‘ 2 − n o r m l`2-norm l‘2−norm.
- i.e.,
D
(
h
i
,
h
j
)
=
g
i
(
h
i
)
T
g
j
(
h
j
)
/
τ
‖
g
i
(
h
i
)
‖‖
g
j
(
h
j
)
‖
D(h_i,h_j)=\frac{g_i(h_i)^Tg_j(h_j)/τ}{‖g_i(h_i)‖‖g_j(h_j)‖}
D(hi,hj)=‖gi(hi)‖‖gj(hj)‖gi(hi)Tgj(hj)/τ ,
- computes the agreement score between vectors with normalizations,
-
Moreover, the Bayesian Personalized Ranking (BPR) loss is equivalent to the InfoNCE loss
- when letting N = 1 N=1 N=1 and D ( h i , h j ) = h i T h j \mathcal{D}(h_i,h_j) = h^T_ih_j D(hi,hj)=hiThj.
3.2.5 Non-Bound Mutual Information Estimators
-
background
- There are other objectives
- that have been used in some studies,
- There are other objectives
-
effect
-
and these objectives can also increase mutual information.
-
However, these objectives are not provable lower-bounds
- to the mutual information,
- and optimizing these objectives does not
- guarantee the maximization of the mutual information.
-
-
For example, Jiao et al. proposes to minimize the triplet margin loss ,
-
which is commonly used in deep metric learning.
-
Given representations h i , h j h_i, h_j hi,hj and the discriminator D \mathcal{D} D,
-
the triplet margin loss is formalized as
L t r i p l e t = E [ ( A , X ) , ( A ′ , X ′ ) ] ∼ P × P [ m a x { D ( h i , h j ) − D ( h i , h ′ j ) + ϵ , 0 } ] , (19) L_{triplet} = \mathbb{E}_{[(A,X),(A′,X′)]∼P×P} [max\{D(h_i,h_j)−D(h_i, h′_j)+ \epsilon, 0\}], \tag{19} Ltriplet=E[(A,X),(A′,X′)]∼P×P[max{D(hi,hj)−D(hi,h′j)+ϵ,0}],(19)- where D ( h i , h j ) = s i g m o i d ( h i T h j ) D(h_i,h_j)=sigmoid(h^T_ih_j) D(hi,hj)=sigmoid(hiThj)
- and ϵ \epsilon ϵ is the margin value.
-