GraphSAGE无监督学习DGL实现简单梳理-CSDN博客

本文链接：https://blog.csdn.net/u013036495/article/details/108180438

DGL中master分支2020.08.20版本的GraphSAGE无监督的实现梳理。因为master分支变化很大，所以可能以后代码会不太一样。
代码地址：https://github.com/dmlc/dgl/blob/master/examples/pytorch/graphsage/train_sampling_unsupervised.py

1.采样是根据边的id来采的，而且使用了整个graph的所有边。

n_edges = g.number_of_edges()
train_seeds = np.arange(n_edges)

具体的dataloader(即得到每个batch真正训练的数据)代码如下：

    dataloader = dgl.dataloading.EdgeDataLoader(
        g, train_seeds, sampler, exclude='reverse_id',
        # For each edge with ID e in Reddit dataset, the reverse edge is e ± |E|/2.
        reverse_eids=th.cat([
            th.arange(n_edges // 2, n_edges),
            th.arange(0, n_edges // 2)]),
        negative_sampler=NegativeSampler(g, args.num_negs),
        batch_size=args.batch_size,
        shuffle=True,
        drop_last=False,
        pin_memory=True,
        num_workers=args.num_workers)

训练时得到的一个batch训练数据代码如下：

for step, (input_nodes, pos_graph, neg_graph, blocks) in enumerate(dataloader):

这里整体的流程应该如下：

Dataloader得到train_seeds(graph中所有边的id)，每次获取一个batch_size数量的e_id，根据这个e_id得到其两头的结点src和dst,构建一个正样本的子图pos_graph，负样本的子图neg_graph则是通过NegativeSampler，随机替换掉dst构建而成的,假设替换为了dst_neg。需要注意的是，pos_graph和neg_graph最终包含的结点其实都是src,dst和dst_neg(其中的边关系该是怎么样还是怎么样，原因是计算loss的时候需要，可以直接把算出的特征赋值给pos_graph和neg_graph)，最终将以src,dst和dst_neg一起作为seeds，进行sage的子图采样，采样完成的最外层结点会通过input_nodes返回，用于取出对应结点的特征。

2.loss计算
代码如下：

class CrossEntropyLoss(nn.Module):
    def forward(self, block_outputs, pos_graph, neg_graph):
        with pos_graph.local_scope():
            pos_graph.ndata['h'] = block_outputs
            pos_graph.apply_edges(fn.u_dot_v('h', 'h', 'score'))
            pos_score = pos_graph.edata['score']
        with neg_graph.local_scope():
            neg_graph.ndata['h'] = block_outputs
            neg_graph.apply_edges(fn.u_dot_v('h', 'h', 'score'))
            neg_score = neg_graph.edata['score']

        score = th.cat([pos_score, neg_score])
        label = th.cat([th.ones_like(pos_score), th.zeros_like(neg_score)]).long()
        loss = F.binary_cross_entropy_with_logits(score, label.float())
        return loss

可以看到，最终sage得到的每一个batch的输出block_outpus直接赋值给了pos_graph和neg_graph的ndata['h'],这里可以直接赋值的原因就是因为pos_graph和neg_graph中的结点个数和block_outputs的维度相同，因为是以这两个图中的结点作为seeds进行的邻居采样。

具体loss的计算，这里使用的是F.binary_cross_entropy_with_logits，和论文中的好像有一点不一样，但是效果应该是相同的。
论文中的公式为：