GNN algorithms(6): GraphCL

GraphCL for self-supervised pre-training of GNNs. In graph contrastive learning, pre-training is performed through maximizing the agreement between two augmented views of the same graph via a contrastive loss in the latent space.

paper: Graph Contrastive Learning with Augmentations, NeurIPS 2020.

2.2.1 Graph data augmentation

The given graph G undergoes graph data augmentations to obtain two correlated views Gi, Gj, as a positive pair.

2.2.2 GNN-based encoder

A GNN-based encoder f() extracts graph-level representation vectors hi, hj for augmented graphs Gi, Gj. Graph contrastive learning does not apply any constraint on the GNN architecture.

2.2.3 Projection head

A non-linear transformation g()(激活函数) named projection head maps augmented representations to another latent space space where the contrastive loss is calculated, e.g. MLP, to obtain zi, zj.

2.2.4 Contrastive loss function

A contrastive loss function L() is defined to enforce maximizing the consistency between positive pairs zi, zj compared with negative pairs.

3. Summary of GCL

数据增强对GCL至关重要。without any data augmentation graph contrastive learning is not helpful and often worse.
composing different augmentations benefits more.
edge perturbation benefits social networks but hurts some biochemical molecules.
applying attribute masking achieves better performance in denser graphs.
Node dropping and subgraph are generally beneficial across datasets. For subgraph, previous works show that enforcing local (the subgraphs we extract) and global information consistency is helpful for representation learning.

4. GCL Implementation

4.1 semi-supervised 实现

4.2 unsupervised 实现

参考: GraphCL/unsupervised_Cora_Citeseer at master · Shen-Lab/GraphCL · GitHub

1) 构造augmented feature1 and feature2；augmented adjacency matrix1 and matrix2

2) 构建自监督supervised information，由torch.ones全1矩阵和torch.zeros全0矩阵

3）在给定augmented feature and adjacency matrix前提下，由discriminator 1区分第一种数据增强下的feature和shuffled feature，由discriminator 2区分第二种数据增强下的feature和shuffled feature，将结果ret1和ret2相加，作为model学习结果。

4) 将model预测结果与自监督矩阵做反向传播和梯度下降，学习出最优模型参数，以后后面生成feature embeddings。

for epoch in range(nb_epochs):

    model.train()
    optimiser.zero_grad()

    idx = np.random.permutation(nb_nodes)
    shuf_fts = features[:, idx, :]

    lbl_1 = torch.ones(batch_size, nb_nodes)  # labels
    lbl_2 = torch.zeros(batch_size, nb_nodes)
    lbl = torch.cat((lbl_1, lbl_2), 1)

    if torch.cuda.is_available():
        shuf_fts = shuf_fts.cuda()
        lbl = lbl.cuda()

    logits = model(features, shuf_fts, aug_features1, aug_features2,
                   sp_adj if sparse else adj,
                   sp_aug_adj1 if sparse else aug_adj1,
                   sp_aug_adj2 if sparse else aug_adj2,
                   sparse, None, None, None, aug_type=aug_type)

    loss = b_xent(logits, lbl)  # 在augmentation前提下，discriminater学习区分features和shuffle_features.
    print('Loss:[{:.4f}]'.format(loss.item()))

    if loss < best:
        best = loss
        best_t = epoch
        cnt_wait = 0
        torch.save(model.state_dict(), args.save_name)
    else:
        cnt_wait += 1

    if cnt_wait == patience:
        print('Early stopping!')
        break

    loss.backward()
    optimiser.step()

4.2.1 GraphCL subgraph

1) 图对比学习就是为了强化图表示模型graph embedding model的参数，以对抗边扰动的能力 robustness。

2) 程序生成subgraph node feature和adjs应该在epoch之内，而不是一直用那两个不变的子图。

3) discriminator识别两个graph embedding，当feature index shuffle之后，adjs不变，就相当于一个节点embeeding完全变了的new graph，构造的标签当然是(1,1,1..)(0,0,0...)

4) GraphCL model本质是生成一个embedding，discriminator本质是做一个线性变换。

5) 做个loss加权将其融入到Re-HAN模型中去。

4.3 adversarial 实现

4.4 transfer learning 实现

5. 利用Contrastive Loss对比损失思想设计自己的损失函数

利用Contrastive Loss（对比损失）思想设计自己的loss function

5.1 N-pair loss

N-pair loss，从N个不同的类中构造N对样本，自监督学习

5.2 triplet loss

Google 2015， FaceNet

query样本 (data augmentation)和positive samples比较；

query样本和negative samples比较。

要求同一个列中有两个正例，一个负例。

problem: triplet loss考虑的negative samples样本太少了，收敛慢。

5.3 (N+1) tuplet loss

考虑多个负样本的方法：

(N-1)个negative samples
一个positive 样本
原始样本 x

problem: (N+1) tuplet loss会有过大的计算量

N-pair loss重复利用了embedding vectors的计算来作为negative样本。

把其他样本的正样本作为当前样本的负样本 -> 2N计算量

5.4 Instance Discrimination

引入了memory bank机制，并且真正地把loss用到了unsupervised learning

每个单一实例instance都看做不同的"类"。Non-Parametric Softmax Classifier计算每个样本被正确识别的概率。

，将权重w替换为向量v的转置，||v||=1

-》

同时，memory bank存储特征向量。-》NCE(Noise-contrastive estimation)来近似估计softmax数值以减少计算复杂度。

最后用Proximal Regularization稳定训练过程的波动性。实例间的相似度直接从特征中以非参数方式计算。

NCE通过训练分类器从“真实”分布和人工生成的“噪声分布”中区分样本，从而将问题简化为二分类问题。

能否通仅通过特征表示来区分不同的实例
能否通过纯粹的辨别学习 discriminative learning反应样本间的相似性
将不同个例都看做不同的“类”，那这个数量将是巨大的，该如何处理

5.5 NCE loss

https://leimao.github.io/article/Noise-Contrastive-Estimation/

将多分类问题转化为一组二分类问题，其中二分类任务是区分数据样本和噪声样本。

$p(x) = \frac{exp \hat{p}(x)}{Z}$

import random

import torch
from torch import nn

class NCECriterion(nn.Module):

    def __init__(self, nce_m, eps):
        super(NCECriterion, self).__init__()
        self.nce_m = nce_m
        self.eps = eps
        self.relu = nn.ReLU()
        self.softmax = nn.Softmax(dim=0)

    def forward(self, x, labels):
        batch_size = x.size(0)
        # 噪声均匀分布
        noise_distribution = torch.tensor(1/batch_size).repeat(batch_size, 1).t().squeeze()
        # 计算non-parametric softmax classifier
        prob = torch.matmul(x, x.t())
        prob = torch.div(prob, self.eps)

        # softmax 计算概率
        pred_prob = self.softmax(prob)
        true_prob = self.softmax(labels.float())

        # 随机取两个向量v和v', 计算后验概率 h(i,v) = Pmt / (Pmt + k*Pnt)
        v_1_idx = random.randint(0, batch_size-1)
        v_2_idx = random.randint(0, batch_size-1)



        Pmt_1 = pred_prob[v_1_idx]
        Pnt_1 = Pmt_1.add(self.nce_m / batch_size)
        h_1 = torch.div(Pmt_1, Pnt_1)
        Pmt_2 = pred_prob[v_2_idx]
        Pnt_2 = Pmt_2.add(self.nce_m / batch_size)
        h_2 = torch.div(Pmt_2, Pnt_2)

        # 取对数
        h_1_log = torch.log(h_1)
        h_2_log = torch.log(1 - h_2)

        # calculate expectation
        Expection_1 = torch.matmul(true_prob, h_1_log)  # (100,) * (100,) = -4.9336
        Expection_2 = torch.matmul(noise_distribution, h_2_log)  # (1,100) * (100,) =
        # calculate NCE loss function
        nce_loss = -Expection_1 - self.nce_m * Expection_2

        return nce_loss, batch_size