Mutual Information

3.2 Mutual Information

参考文献
	Self-Supervised Learning of Graph Neural Networks: A Unified Review
		http://arxiv.org/abs/2102.10757
	https://blog.csdn.net/haolexiao/article/details/70142571?spm=1001.2014.3001.5506
  • I I I the amount of information

    • indicates the length of encoding required for the message

    • Calculation formula
      I = l o g ( 1 p ( x ) ) = − l o g ( p ( x ) ) I=log(\frac{1}{p(x)})=−log(p(x)) I=log(p(x)1)=log(p(x))

      • p ( x ) p(x) p(x) represents the probability of information x x x

      • the higher the frequency of information, the smaller its length, that is,

        • the smaller amount of information.
      • p(a) = 50%, 
        p(b) = 20%, 
        
        I_a < I_b
        
  • H ( . ) H(.) H(.) denote Information entropy

    • Mathematical Expectation of Information I I I of distribution p p p

    • Calculation formula
      H ( p ) = ∑ x p ( x ) l o g ( 1 p ( x ) ) = − ∑ x p ( x ) l o g ( p ( x ) ) H(p) = \underset{x}{\sum}p(x)log(\frac{1}{p(x)}) = -\underset{x}{\sum}p(x)log(p(x)) H(p)=xp(x)log(p(x)1)=xp(x)log(p(x))

      • p p p denote distribution
  • H q ( p ) H_q(p) Hq(p) cross-entropy

    • encodes its true distribution p p p by encoding a guessed distribution q q q,

      • Distributions may be understood as labels

      H q ( p ) = H ( p , q ) = ∑ x p ( x ) l o g ( 1 q ( x ) ) = − ∑ x p ( x ) l o g ( q ( x ) ) H_q(p) = H(p,q)= \underset{x}{\sum}p(x)log(\frac{1}{q(x)}) = -\underset{x}{\sum}p(x)log(q(x)) Hq(p)=H(p,q)=xp(x)log(q(x)1)=xp(x)log(q(x))

      L = − ∑ i [ y i l o g ( q ( x i ) ) + ( 1 − y i ) l o g ( 1 − ( q ( x i ) ) ] (Machine Learning) \mathcal{L} = -\underset{i}{\sum}[y_ilog(q(x_i)) + (1-y_i)log(1-(q(x_i))] \tag{Machine Learning} L=i[yilog(q(xi))+(1yi)log(1(q(xi))](Machine Learning)

      • Cross-entropy is often used as the final loss function in the field of machine learning
      • Cross-entropy is essentially a measure of the difference
        • between the two encodings,
        • because the smaller the value is only
          • when the guessed distribution is approximately close to the true distribution.
  • KL Divergence

    • KL divergence/distance is a measure of the distance between

      • two distributions,
    • The KL distance is generally called the relative entropy of q q q to p p p

      • by D ( p ∣ ∣ q ) D(p||q) D(p∣∣q)
        D ( p ∣ ∣ q ) = H q ( p ) − H ( p ) = ∑ x p ( x ) l o g ( 1 q ( x ) ) − ∑ x p ( x ) l o g ( 1 p ( x ) ) = − ∑ x p ( x ) l o g q ( x ) p ( x ) D(p||q) = H_q(p)-H(p) = \underset{x}{\sum}p(x)log(\frac{1}{q(x)}) - \underset{x}{\sum}p(x)log(\frac{1}{p(x)})\\ =-\underset{x}{\sum}p(x)log\frac{q(x)}{p(x)} D(p∣∣q)=Hq(p)H(p)=xp(x)log(q(x)1)xp(x)log(p(x)1)=xp(x)logp(x)q(x)
  • The relationship between two variables in a joint distribution

    • (that is, in the same distribution) affecting each other.
      H ( X , Y ) = H ( X ) + H ( Y ∣ X ) = H ( Y ) + H ( X ∣ Y ) H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y) H(X,Y)=H(X)+H(YX)=H(Y)+H(XY)

    • Joint information entropy:
      H ( x , y ) = ∑ x , y p ( x , y ) l o g ( 1 p ( x , y ) ) = − ∑ x , y p ( x , y ) l o g ( p ( x , y ) ) H(x,y) = \underset{x,y}{\sum}p(x,y)log(\frac{1}{p(x,y)}) = -\underset{x,y}{\sum}p(x,y)log(p(x,y)) H(x,y)=x,yp(x,y)log(p(x,y)1)=x,yp(x,y)log(p(x,y))

    • Conditional information entropy:
      H ( x ∣ y ) = ∑ y p ( y ) ∑ x p ( x ∣ y ) l o g ( 1 p ( x ∣ y ) ) = ∑ x , y p ( x , y ) l o g ( 1 p ( x ∣ y ) ) = − ∑ x , y p ( x , y ) l o g ( p ( x ∣ y ) ) H(x|y) = \underset{y}{\sum}p(y)\underset{x}{\sum}p(x|y)log(\frac{1}{p(x|y)}) \\ = \underset{x,y}{\sum}p(x,y)log(\frac{1}{p(x|y)}) = -\underset{x,y}{\sum}p(x,y)log(p(x|y)) H(xy)=yp(y)xp(xy)log(p(xy)1)=x,yp(x,y)log(p(xy)1)=x,yp(x,y)log(p(xy))

3.2.1 Mutual Information Estimation

  • effect

    • mutual information measures the correlation between two random variables:
      I ( x , y ) = H ( x ) − H ( x ∣ y ) = H ( y ) − H ( y ∣ x ) = H ( x ) + H ( y ) − H ( x , y ) = D K L ( p ( x , y ) ∣ ∣ p ( x ) p ( y ) ) \mathcal{I}(x,y) = H(x) - H(x|y) = H(y) - H(y|x) = H(x) + H(y) - H(x,y)\\ = D_{KL}(p(x,y)||p(x)p(y))\\ I(x,y)=H(x)H(xy)=H(y)H(yx)=H(x)+H(y)H(x,y)=DKL(p(x,y)∣∣p(x)p(y))

      • H ( . ) H(.) H(.) denote Information entropy
        • represents the amount of information in a distribution, or the average length of the code
  • Given a pair of random variables ( x , y ) (x, y) (x,y), the mutual information I ( x , y ) \mathcal{I}(x, y) I(x,y) measures

    • the information that x x x and y y y share,

    I ( x , y ) = D K L ( p ( x , y ) ∣ ∣ p ( x ) p ( y ) ) = E p ( x , y ) [ l o g   p ( x , y ) p ( x ) p ( y ) ] , (9 10) \mathcal{I}(x,y) = D_{KL}(p(x,y)||p(x)p(y))\\ =\mathbb{E}_{p(x,y)}[log~\frac{p(x,y)}{p(x)p(y)}], \tag{9~10} I(x,y)=DKL(p(x,y)∣∣p(x)p(y))=Ep(x,y)[log p(x)p(y)p(x,y)],(9 10)

    • where D K L D_{KL} DKL denotes the Kullback-Leibler (KL) divergence.
  • The contrastive learning

    • target
      • seeks to maximize the mutual information
      • between two views as two random variables.
    • practice
      • it trains the encoders to be contrastive between
        • representations of a positive pair of views that comes from the joint distribution p ( v i , v j ) p(v_i, v_j) p(vi,vj)
        • and representations of a negative pair of views that comes from the product of marginals p ( v i ) p ( v j ) p(v_i )p(v_j) p(vi)p(vj).
  • In order to computationally estimate and maximize the mutual information in the contrastive learning,

    • three typical lower-bounds to the mutual information

      • namely, the Donsker-Varadhan representation I ( D V ) \mathcal{I}^{(DV)} I(DV),
      • the Jensen-Shannon estimator I ( J S ) \mathcal{I}^{(JS)} I(JS),
      • and the noisecontrastive estimation ̂ I ( N C E ) \mathcal{I}^{(NCE)} I(NCE),
    • Among the three lower-bounds, I ( J S ) \mathcal{I}^{(JS)} I(JS) and I ( N C E ) \mathcal{I}^{(NCE)} I(NCE) are commonly used as objectives

      • in the contrastive learning in graphs.

Discriminator

  • A mutual information estimation is usually computed

    • based on a discriminator D : R q × R q → R D:\mathbb{R}^q×\mathbb{R}^q→\mathbb{R} D:Rq×RqR that

      • maps the representations of two views to an agreement score
        • between the two representations.
  • The discriminator D D D can be either parametric or non-parametric.

    • For example, the discriminator can optionally apply a set of projection heads to the representations h 1 , ⋅ ⋅ ⋅ , h k h_1,···,h_k h1,⋅⋅⋅,hk
      • before computing the pairwise similarity.
  • We formalize the optional projection heads as g 1 , ⋅ ⋅ ⋅ , g k g_1, · · · , g_k g1,⋅⋅⋅,gk such that
    z i = g i ( h i ) , i = 1 , . . . , k , (11) z_i=g_i(h_i),i=1,...,k, \tag{11} zi=gi(hi),i=1,...,k,(11)

    • where g i g_i gi can be an identical mapping, a linear projection or an MLP.

    • Parameterized g i g_i gi are optimized simultaneously

      • with the encoders f i f_i fi in Eqn. (8), given by
        m a x { f i , g i } i = 1 k 1 ∑ i ≠ j α i j [ ∑ i ≠ j α i , j I ^ g i , g j ( h i , h ) ] , (12) \underset{\{f_i,g_i\}^k_{i=1}}{max} \frac{1}{\sum_{i\neq j} \alpha_{ij}} \Big[\sum_{i\neq j}\alpha_{i,j} \mathcal{\hat{I}_{g_i,g_j}(h_i,h)} \Big], \tag{12} {fi,gi}i=1kmaxi=jαij1[i=jαi,jI^gi,gj(hi,h)],(12)

3.2.2 Donsker-Varadhan Estimator

  • The Donsker-Varadhan (DV) estimator,

    • also knwon as the DV representation of the KL divergence,

      • is a lower-bound to the mutual information
    • target

      • and hence can be applied to maximize the mutual information.
  • Given h i h_i hi and h j h_j hj, the lower-bound is computed as
    I ^ ( h i . h j ) = E p ( h i , h j ) [ D ( h i , h j ) ] − l o g E p ( h i ) p ( h j ) [ e D ( h i , h j ) ] (13) \hat{\mathcal{I}}(h_i.h_j)=\mathbb{E}_{p(h_i,h_j)}[\mathcal{D}(h_i,h_j)]-log\mathbb{E}_{p(h_i)p(h_j)}[e^{\mathcal{D}(h_i,h_j)}] \tag{13} I^(hi.hj)=Ep(hi,hj)[D(hi,hj)]logEp(hi)p(hj)[eD(hi,hj)](13)

    E p ( h i , h j ) = ∑ h i , h j p ( h i , h j ) E p ( h i ) p ( h j ) = ∑ h i , h j p ( h i ) p ( h j ) \mathbb{E}_{p(h_i,h_j)} = \underset{h_i,h_j}{\sum}p(h_i,h_j)\\ \mathbb{E}_{p(h_i)p(h_j)} = \underset{h_i,h_j}{\sum}p(h_i)p(h_j)\\ Ep(hi,hj)=hi,hjp(hi,hj)Ep(hi)p(hj)=hi,hjp(hi)p(hj)

    • where p ( h i , h j ) p(h_i, h_j) p(hi,hj) denotes the joint distribution of the two representations h i h_i hi , h j h_j hj

    • and p ( h i ) p ( h j ) p(h_i)p(h_j) p(hi)p(hj) denotes the product of marginals.

    • E p ( h i , h j ) \mathbb{E}_{p(h_i,h_j)} Ep(hi,hj) denotes the Expectation of p ( h i , h j ) p(h_i,h_j) p(hi,hj)

  • For simplicity and to include the graph data distribution P P P,

    • we assume transformations T i \mathcal{T}_i Ti to be deterministic

    • and encoders f i f_i fi to be injective,

      • and have p ( h i , h j ) = p ( h i ) p ( h j ∣ h i ) = p ( f i ( T i ( A , X ) ) ) p ( f j ( T j ( A , X ) ) ∣ ∣ ( A , X ) ) p(h_i,h_j)=p(h_i)p(h_j|h_i)=p(f_i(\mathcal{T}_i(A,X)))p(f_j(\mathcal{T}_j(A,X))∣∣(A, X)) p(hi,hj)=p(hi)p(hjhi)=p(fi(Ti(A,X)))p(fj(Tj(A,X))∣∣(A,X)) We hence re-write Eqn. (13) as
        I ^ ( D V ) ( h i . h j ) = E ( A , X ) ∼ P [ D ( h i , h j ) ] − l o g   E [ ( A , X ) , ( A ‘ , X ‘ ) ] ∼ P × P [ e D ( h i , h j ′ ) ] (14) \hat{\mathcal{I}}^{(DV)}(h_i.h_j)=\mathbb{E}_{(A,X)∼\mathcal{P}}[\mathcal{D}(h_i,h_j)]-log~\mathbb{E}_{[(A,X),(A`,X`)]∼\mathcal{P}\times \mathcal{P}}[e^{\mathcal{D}(h_i,h'_j)}] \tag{14} I^(DV)(hi.hj)=E(A,X)P[D(hi,hj)]log E[(A,X),(A,X)]P×P[eD(hi,hj)](14)

        • where h i h_i hi and h j h_j hj in the first term
          • are computed from ( A , X ) (A, X) (A,X) distributed from P P P,
        • h i h_i hi and h ′ j h′_j hj in the second term
          • are computed from ( A , X ) (A, X) (A,X) and ( A ′ , X ′ ) (A′, X′) (A,X) identically and independently distributed from P P P, respectively.
        • In following descriptions of other objectives, we use the later version that includes P P P.

3.2.3 Jensen-Shannon Estimator

  • Compared to the Donsker-Varadhan estimator,

    • the JensenShannon estimator enables more efficient estimation

      • and optimization of the mutual information

        • by computing the JS-divergence between

          • the joint distribution and the product of marginals.
  • Given two representations h i h_i hi and h j h_j hj computed from the random variable ( A , X ) (A, X) (A,X) and a discriminator D D D,

    • DGI , InfoGraph, and Hassani and Khasahmadi computes the JS estimator
      I ^ ( J S ) ( h i . h j ) = E ( A , X ) ∼ P [ l o g ( D ( h i , h j ) ) ] − E [ ( A , X ) , ( A ‘ , X ‘ ) ] ∼ P × P [ l o g ( 1 − D ( h i , h j ′ ) ) ] (15) \hat{\mathcal{I}}^{(JS)}(h_i.h_j)=\mathbb{E}_{(A,X)∼\mathcal{P}}[log(\mathcal{D}(h_i,h_j))]-\mathbb{E}_{[(A,X),(A`,X`)]∼\mathcal{P}\times \mathcal{P}}[log(1-\mathcal{D}(h_i,h'_j))] \tag{15} I^(JS)(hi.hj)=E(A,X)P[log(D(hi,hj))]E[(A,X),(A,X)]P×P[log(1D(hi,hj))](15)

    • where h i , h j h_i, h_j hi,hj in the first term are computed from ( A , X ) (A, X) (A,X) distributed from P,

    • h i h_i hi and h ′ j h′_j hj in the second term are computed from ( A , X ) (A, X) (A,X) and ( A ′ , X ′ ) (A′, X′) (A,X)

    • identically and independently distributed

      • from the distribution P P P.
  • Note that depict a softplus version of the JS estimator,
    I ^ ( J S − S P ) ( h i . h h ) = E ( A , X ) ∼ P [ − s p ( D ( h i , h j ) ) ] − E [ ( A , X ) , ( A ‘ , X ‘ ) ] ∼ P × P [ s p ( D ′ ( h i , h j ′ ) ) ] (16) \hat{\mathcal{I}}^{(JS-SP)}(h_i.h_h)=\mathbb{E}_{(A,X)∼\mathcal{P}}[-sp(\mathcal{D}(h_i,h_j))]-\mathbb{E}_{[(A,X),(A`,X`)]∼\mathcal{P}\times \mathcal{P}}[sp(\mathcal{D'}(h_i,h'_j))] \tag{16} I^(JSSP)(hi.hh)=E(A,X)P[sp(D(hi,hj))]E[(A,X),(A,X)]P×P[sp(D(hi,hj))](16)

    • where s p ( x ) = l o g ( 1 + e x ) sp(x) = log(1 + e^x) sp(x)=log(1+ex).
  • We consider the JS estimators in Eqn. (15) and Eqn. (16) to be equivalent

    • by letting D ( h i , h j ) = s i g m o i d ( D ′ ( h i , h j ) ) D(h_i, h_j) = sigmoid(D′(h_i, h_j)) D(hi,hj)=sigmoid(D(hi,hj)).
  • For the the negative pairs of graphs [ ( A , X ) , ( A ′ , X ′ ) ] ∼ P × P [(A, X),(A′,X′)]∼P×P [(A,X),(A,X)]P×P in particular,

    • DGI samples one graph ( A , X ) (A, X) (A,X) from the training dataset
      • and applies a stochastic corruption C C C to obtain ( A ′ , X ′ ) = C ( A , X ) (A′,X′) = C(A,X) (A,X)=C(A,X).
  • The other studies independently sample two graphs from the training dataset.

  • Discriminators in JS estimators usually compute the agreement score

    • between two vectors
      • by their inner product with sigmoid, i.e.,
        • D ( h i , h j ) = s i g m o i d ( z i T , z j ) = s i g m o i d ( g i ( h i ) T g j ( h j ) ) \mathcal{D}(h_i,h_j)=sigmoid(z^T_i,z_j)=sigmoid(g_i(h_i)^Tg_j(h_j)) D(hi,hj)=sigmoid(ziT,zj)=sigmoid(gi(hi)Tgj(hj)).

3.2.4 InfoNCE

  • I ^ ( N C E ) \hat{\mathcal{I}}^{(NCE)} I^(NCE) is another lower-bound to the mutual information I \mathcal{I} I.

  • Given the representations h i h_i hi and h j h_j hj of two views of random variable ( A , X ) (A, X) (A,X), the discriminator D \mathcal{D} D, and the number of negative samples N N N ,

    • the InfoNCE is formalized as
      I ^ ( N C E ) ( h i . h h ) = E ( A , X ) ∼ P [ D ( h i , h j ) − E K ∼ P K [ l o g ∑ ( A ′ , X ′ ) ∈ K e D ( h i , h j ′ ) / N ∣ ( A , X ) ] ] = E [ ( A , X ) , K ] ∼ P × P N [ l o g e D ( h i , h j ) ∑ x ′ ∈ B / { x } e D ( h i , h j ′ ) ] + l o g   N (17) \hat{\mathcal{I}}^{(NCE)}(h_i.h_h)=\mathbb{E}_{(A,X)∼\mathcal{P}}[\mathcal{D}(h_i,h_j)- \mathbb{E}_{K∼\mathcal{P}^K}[log\underset{(A',X')\in K}{\sum}e^{\mathcal{D}(h_i,h'_j)/N}|(A,X)] ] \\ = \mathbb{E}_{[(A,X),K]∼\mathcal{P}\times \mathcal{P}^N} [log\frac{e^{\mathcal{D}(h_i,h_j)}}{\sum_{x'\in B/\{x\}}e^{\mathcal{D}(h_i,h'_j)}}] + log~N \tag{17} I^(NCE)(hi.hh)=E(A,X)P[D(hi,hj)EKPK[log(A,X)KeD(hi,hj)/N(A,X)]]=E[(A,X),K]P×PN[logxB/{x}eD(hi,hj)eD(hi,hj)]+log N(17)

      • where K K K consists of N random variables identically and independently distributed from P P P,
      • h i , h j h_i, h_j hi,hj are the representations of the i − t h i-th ith and j − t h j-th jth views of ( A , X ) , (A, X), (A,X),
      • and h ′ j h′_j hj is the representation of the j − t h j-th jth view of ( A ′ , X ′ ) (A′, X′) (A,X).
  • In practice, we compute the InfoNCE on mini-batches of size N + 1 N + 1 N+1.

  • For each sample x x x in a mini-batch B B B,

    • we consider the set of the rest N N N samples as a sample of K K K.

    • We then discard the constant term l o g N log N logN in Eqn. (17) and minimize the loss
      L I n f o N C B = − 1 N + 1 ∑ x ∈ B [ l o g e D ( h i , h j ) ∑ x ′ ∈ B / { x } e D ( h i , h j ′ ) ] (18) \mathcal{L}_{InfoNCB}=-\frac{1}{N+1}\underset{x\in B}{\sum}[log\frac{e^{\mathcal{D}(h_i,h_j)}}{\sum_{x'\in B/\{x\}}e^{\mathcal{D}(h_i,h'_j)}}] \tag{18} LInfoNCB=N+11xB[logxB/{x}eD(hi,hj)eD(hi,hj)](18)

      • Intuitively, the optimization of InfoNCE loss aims to score the agreement between
        • h i h_i hi and h j h_j hj of views
          • from the same instance x x x higher than
            • between h i h_i hi and h ′ j h′_j hj from the rest N N N negative samples B / { x } B /\{x\} B/{x}.
  • Discriminators in typical InfoNCE compute the agreement score between two vectors

    • by their inner product, i.e., D ( h i , h j ) = z i T z j = g i ( h i ) T g j ( h j ) \mathcal{D}(h_i,h_j) = z^T_iz_j = g_i(h_i)^Tg_j(h_j) D(hi,hj)=ziTzj=gi(hi)Tgj(hj).
  • A specific type of the InfoNCE loss, known as the NT-Xent loss,

    • includes a preset temperature parameter τ τ τ in the computation of discriminator D D D
      • in the InfoNCE loss, i.e., D ( h i , h j ) = g i ( h i ) T g j ( h j ) / τ D(hi, hj) = gi(h_i)^T gj(h_j)/τ D(hi,hj)=gi(hi)Tgj(hj)/τ .
  • In addition, the discriminator in You et al.

    • computes the agreement score between vectors with normalizations,
      • i.e., D ( h i , h j ) = g i ( h i ) T g j ( h j ) / τ ‖ g i ( h i ) ‖‖ g j ( h j ) ‖ D(h_i,h_j)=\frac{g_i(h_i)^Tg_j(h_j)/τ}{‖g_i(h_i)‖‖g_j(h_j)‖} D(hi,hj)=gi(hi)‖‖gj(hj)gi(hi)Tgj(hj)/τ ,
        • where ‖ ⋅ ‖ ‖·‖ denotes the l ‘ 2 − n o r m l`2-norm l‘2norm.
  • Moreover, the Bayesian Personalized Ranking (BPR) loss is equivalent to the InfoNCE loss

    • when letting N = 1 N=1 N=1 and D ( h i , h j ) = h i T h j \mathcal{D}(h_i,h_j) = h^T_ih_j D(hi,hj)=hiThj.

3.2.5 Non-Bound Mutual Information Estimators

  • background

    • There are other objectives
      • that have been used in some studies,
  • effect

    • and these objectives can also increase mutual information.

    • However, these objectives are not provable lower-bounds

      • to the mutual information,
      • and optimizing these objectives does not
        • guarantee the maximization of the mutual information.
  • For example, Jiao et al. proposes to minimize the triplet margin loss ,

    • which is commonly used in deep metric learning.

    • Given representations h i , h j h_i, h_j hi,hj and the discriminator D \mathcal{D} D,

    • the triplet margin loss is formalized as
      L t r i p l e t = E [ ( A , X ) , ( A ′ , X ′ ) ] ∼ P × P [ m a x { D ( h i , h j ) − D ( h i , h ′ j ) + ϵ , 0 } ] , (19) L_{triplet} = \mathbb{E}_{[(A,X),(A′,X′)]∼P×P} [max\{D(h_i,h_j)−D(h_i, h′_j)+ \epsilon, 0\}], \tag{19} Ltriplet=E[(A,X),(A,X)]P×P[max{D(hi,hj)D(hi,hj)+ϵ,0}],(19)

      • where D ( h i , h j ) = s i g m o i d ( h i T h j ) D(h_i,h_j)=sigmoid(h^T_ih_j) D(hi,hj)=sigmoid(hiThj)
      • and ϵ \epsilon ϵ is the margin value.
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值