第五周.02.标签传播与社群检测

最新推荐文章于 2024-05-06 23:54:56 发布

oldmao_2000

最新推荐文章于 2024-05-06 23:54:56 发布

阅读量819

点赞数 1

分类专栏： # 小班课笔记

本文链接：https://blog.csdn.net/oldmao_2001/article/details/118520063

版权

小班课笔记专栏收录该内容

27 篇文章 41 订阅

订阅专栏

文章目录

标签传播
社群检测
算法实现

本文内容整理自深度之眼《GNN核心能力培养计划》
公式输入请参考：在线Latex公式

标签传播

Learning from Labeled and Unlabeled Data with Label Propagation
一篇比较老的文章，因此一些写法和现在的套路对不上。另外整体比较短。

摘要

本文用未标记的数据来辅助标签数据来进行分类任务。
We investigate the use of unlabeled data to help labeled data in classification.
本文提出一种迭代的方法：标签传播，该方法将标签在数据集中密集区域内进行传播。这里的密集区域是用未标记数据来判定的，简单来说就是不看标签，直接看数据之间的图结构信息。
We propose a simple iterative algorithm, label propagation, to propagate labels through the dataset along high density areas defined by unlabeled data.
本文对提出的算法及相关算法进行了分析。
We analyze the algorithm, show its solution, and its connection to several other algorithms.
本文提出了如何使用最小生成树来学习模型参数，如何进行特征选取等，实验结果非常好。
We also show how to learn parameters by minimum spanning tree heuristic and entropy minimization, and the algorithm’s ability to perform feature selection. Experiment results
are promising.

Introduction不看，直接从第二节开始看。

2.1 Problem setup

标签数据表示为： $(x_1,y_1)\cdots(x_l,y_l)$
标签类别表示为： $Y_L=\{y_1\cdots y_l\}\in\{1\cdots C\}$
$C$ 是类别数量且为已知，所有的类别均已出现在标签数据中，不会有新的类别出现
未标注数据表示为： $(x_{l+1},y_{l+1})\cdots(x_{l+u},y_{l+u})$
可以看到未标注数据的个数要远远大于标签数据： $l < < u$
标签数据和未标注数据集合 $X$ 的维度是 $D$ ： $X=\{x_1\cdots x_{l+u}\}\in R^D$
任务是用标签数据和未标注数据集合 $X$ 和标签 $Y_L$ 来预测未标记数据的标签 $Y_U$ 。
模型思想和GCN一样，相邻节点会有相似的表征或标签。
Intuitively, we want data points that are close to have similar labels.
但是这里的数据没有图的结构，也不存在相邻节点，因此第一步就是做一个全连接的图，涵盖所有节点，并用节点之间的欧氏距离 $d_{ij}$ 来表示节点的边，同时距离越小，两个节点之间的权重 $w_{ij}$ 越大（ $\sigma$ 是超参数）：
$w_{ij}=\exp\left(-\cfrac{d^2_{ij}}{\sigma^2}\right)$
然后给所有点加上soft labels，就是每个节点上有 $C$ 个标签的概率分布（概率和为1），这里对应可以看成是GCN里面的每个节点的初始化embedding；然后让每个节点的soft labels通过边来进行传播，当边的权重 $w_{ij}$ 越大，那么标签传播越容易，这里对应可以看成是GCN里面的卷积操作。
这里给出标签从点 $i$ 传播到 $j$ 的转移概率可以算出来，然后把所有节点都算出来就变成了一个概率转移矩阵：
$T_{ij}=P(j\rightarrow i)=\cfrac{w_{ij}}{\sum_{k=1}^{l+u}w_{kj}}$
矩阵大小为： $(l+u)\times (l+u)$ ，分母带邻接矩阵中的行求和，也就是邻居节点求和，相当于归一化操作。
文章还定义了一个 $(l+u)\times C$ 的标签矩阵 $Y$ 。

2.2 The algorithm

The label propagation algorithm is as follows:

All nodes propagate labels for one step: $\leftarrow TY$
Row-normalize $Y$ to maintain the class probability interpretation.
Clamp the labeled data. Repeat from step 2 until $Y$ converges.
在第二步里面得到第一步标签传播的结果后会去比对前 $l$ 个有标签的结果，会将前 $l$ 个重置为ground truth。
第三步里面提到，防止标签数据的fade away，这里会固定有标签数据节点的分布（这里其实是相对于独热编码，如有是三分类，数据节点属于第二类，那么就表示为[0,1,0]）
Step 3 is critical: Instead of letting the labeled data points ‘fade away’, we clamp their class
distributions to $Y_{ic}=\delta(y_i,c)$

下面对算法的收敛性进行了数学上的证明，大概思路是将转移矩阵分块（亦可以参考GGNN论文思路），然后得到 $Y_U$ 的推导公式，然后证明经过若干次迭代后，其极限是收敛的。

社群检测

这个其实也是标签传播的派系，相同标签可以看做同一个社群，只不过在其基础上进行了改进。
Near linear time algorithm to detect community structures in large-scale networks

摘要

开门见山提出社群检测任务的重要性和应用。
Community detection and analysis is an important methodology for understanding the organization of various real-world networks and has applications in problems as diverse as consensus formation in social communities or the identification of functional modules in biochemical networks.

给出当前研究的不足：computationally expensive。
Currently used algorithms that identify the community structures in large-scale real-world networks require a priori information such as the number and sizes of communities or are computationally expensive.

本文提出的方法是？
In this paper we investigate a simple label propagation algorithm that uses the network structure alone as its guide and requires neither optimization of a predefined objective function nor prior information about the communities.

算法的原理是？
In our algorithm every node is initialized with a unique label and at every step each node adopts the label that most of its neighbors currently have. In this iterative process densely connected groups of nodes form a consensus on a unique label to form communities.

得到的结果是？
We validate the algorithm by applying it to networks whose community structures are known. We also demonstrate that the algorithm takes an almost linear time and hence it is computationally less
expensive than what was possible so far.

COMMUNITY DETECTION USING LABEL PROPAGATION

关于基础的设定就不写了，和上面一样的，这里不同的是提出了两种更新，一种是同步更新，就是上面的那种更新方式，t时刻的节点信息是由t-1时刻的邻居信息汇聚而来。但是这种更新方式对于bi-partite和star类型（star类型是bi-partite的特例）的图会有震荡的现象，具体看原文图三：
在这里插入图片描述
可以从图中可以看到t和t+1时刻的图就开始发生震荡了，因为左右两边的节点都是同时更新的，所以会发生互换标签的结果。
因此这里采用异步更新的方式，将迭代过程更新后的标签结果及时作为下一个节点计算的参照。

标签传播算法

当节点的标签不再变化的时候，迭代停止，由于节点的邻居的最大分类节点数量相同（例如一个节点10个邻居，有3个类，3个类的节点数量是442），那么当前节点会随机选择一个类别作为标签（上面的例子中可以随机选44两个中的一个作为当前节点的标签），因此在迭代过程中，节点的标签可能会出现变化。上述过程用数学可以表达为：
$\text{If }i \text{ has label }C_m \text{ then }d_i^{C_m}\ge d_i^{C_j} \quad \forall j$
迭代终止后，相同label的会分成一个communities。
具体算法描述如下：
1.Initialize the labels at all nodes in the network. For a given node $x$ , $C_x(0)=x$ .用id做节点标签初始化，类似独热编码效果
2. Set $t = 1$ .
3. Arrange the nodes in the network in a random order and set it to $X$ .本算法对点的顺序没有要求，因此进行打乱
4. For each $x\in X$ chosen in that specific order, let $C_x(t)= f(C_{x_{i1}}(t) , . . . ,C_{x_{im}}(t) ,C_{x_{i(m+1)}}(t−1) , . . . ,C_{x_{ik}(t−1)})$ . $f$ here returns the label occurring with the highest frequency
among neighbors and ties are broken uniformly randomly.这里是找邻居出现频率最高的节点数量的类别作为当前节点类别。
5. If every node has a label that the maximum number of their neighbors have, then stop the algorithm. Else, set $t = t + 1$ and go to 3.
由于引入了随机性，因此算法运行结果不唯一，是不稳定的算法。

算法实现

自定义LabelPropagation

import torch 
import torch.nn as nn 
import torch.nn.functional as F 
import dgl.function as fn

class LabelPropagation(nn. Module): 
    #初始化
    def init(self, num_layers, alpha):
        super(LabelPropagation, self).__init()
        self.num_layers=num_layers
        self.alpha=alpha
    
    @torch.no_grad()
    def forward(self,g, labels, mask=None, post_step=lambda y:y. clamp_(0.,1.)):
        with g.local_scope():
            # ID转独热编码
            if labels. dtype == torch.long:
                labe1s=F.one_hot(labels.view(-1)).to(torch.float32)
            
        #这里面mask是train_idx，只保留这些标签，其余要做test的标签全部置0    
        y=labels 
        if mask is not None:
            y=torch.zeros_like(labels)#函数主要是想实现构造一个矩阵y，其维度与矩阵labels一致，并为其初始化为全0；这个函数方便的构造了新矩阵，无需参数指定shape大小；
            y[mask]=labels[mask]
        
        #(1-alpha)Y
        last=(1-self.alpha)*y
        #degs是矩阵D
        degs=g.in_degrees().float().clamp(min=1)
        #norm=D^{-0.5}
        norm=torch.pow(degs,-0.5).to(labels.device).unsqueeze(1)
        
        
        for _ in range (self.num_layers):
            #Assume the graphs to be undirected
            #D^{-0.5}Y 
            g.ndatat['h']=y*norm
            #h=A D^{-0.5} Y
            g.update_a1l(fn.copy_u('h','m'),fn.sum('m','h'))
            #last=（1-a）Y 
            #g.ndata.pop（'h）=A D*（-0.5）Y 
            #g.ndata.pop（'h"）*norm=D*{-0.5）A D*{-0.5）Y
            y=last+self.alpha*g.ndata.pop('h')*norm
            #post_step=lambday:y.clamp_(0.，1.)
            #[min，max]=>取值范围[0，1]，这里做截断
            y=post_step(y)
            last=(1-self.alpha)*y 
            
        return y

主函数

import argparse 
import dgl
from dgl.data import CoraGraphDataset, CiteseerGraphDataset, PubmedGraphDataset
#from model import LabelPropagation# 自定义模块

def main():
    # check cuda 
    device=f'cuda:{args.gpu}' if torch.cuda.is_available() and args.gpu >=0 else 'cpu'
    print(device)
    
    # load data 
    if args.dataset =='Cora': #默认数据集
        dataset=CoraGraphDataset()
    elif args.dataset =='Citeseer': 
        datasep=CiteseerGraphDataset()
    elif args.dataset =='Pubmed': 
        dataset=PubmedGraphDataset()
    else: 
        raise ValueError('Dataset {} is invalid.'.format(args.dataset))

    g = dataset[0]#取数据集的第一张图

    #邻接矩阵A+对角矩阵I
    g = dgl.add_self_loop(g)


    labels=g.ndata.pop('label').to(device).long()#取标签

    # load masks for train/test, valid is not used.没有用测试集
    train_mask=g.ndata.pop('train_mask')
    test_mask=g.ndata.pop('test_mask')

    #得到矩阵的非0元素的索引（index）
    train_idx=torch.nonzero(train_mask, as_tuple=False).squeeze().to(device)
    test_idx=torch.nonzero(test_mask, as_tuple=False).squeeze().to(device)

    g=g.to(device)

    #定义标签传播模型
    #num_layers是终止传播的层数（原论文的终止条件不一样）
    #alpha是邻居和自身的重要性权重
    lp=LabelPropagation(args.num_layers,args.alpha)

    #调用标签传播模型,logits得到传播结果
    logits=lp(g,labels,mask=train_idx)

    #计算准确率
    test_acc = torch.sum(logits[test_idx].argmax(dim=1) == labels[test_idx]).item()/len(test_idx)
    print("Test Acc {:.4f}". format(test_acc))