LINE: Large-scale Information Network Embedding

最新推荐文章于 2022-08-06 10:24:38 发布

BUPT-WT

最新推荐文章于 2022-08-06 10:24:38 发布

阅读量451

点赞数

分类专栏： Paper NLP

本文链接：https://blog.csdn.net/weixin_41362649/article/details/111129166

版权

NLP 同时被 2 个专栏收录

21 篇文章 3 订阅

订阅专栏

Paper

16 篇文章 0 订阅

订阅专栏

LINE算法意义:

1、适用于任意类型的网络，有向无向有权无权

2、清晰优化目标函数、维护一阶和二阶相似度

3、百万级和十亿级条边几个小时训练完

4、LINE是WWW2015引用量最高的文章

5、与Deepwalk(2014)、node2vec(2016)一样是早期网络学习的代表性工作，经典baselines

6、启发大量基于网络结构来做表征学习的工作

图学习领域 (人工提取特征-基于特征工程) <-------------- Deepwalk、Line、Node2vec --------------> 深度学习领域(特征和分类于一体 - 基于神经网络)

论文下载地址: http://de.arxiv.org/pdf/1503.03578

这篇论文主要结构如下所示:

一、摘要Abstract

介绍背景提出LINE模型，维护局部和全局网络结构

1、提出LINE算法适用于任意类型的信息网络: 有向/无向，有权/无权

2、LINE算法的目标函数同时保留了局部和全局的网络结构

3、讨论算法如何在多个领域上如单词网络、社交网络和引用网络都验证了模型的有效性

4、强调LINE算法的规模性，单机上几个小时之内可训练百万级网络

二、Introduction

介绍图的重要性与以前的方法做对比，说明DeepWalk缺乏清晰的目标函数

1、提出的顺序DeepWalk-2014、Line-2015、Node2vec-2016

2、Node2vec设置p=q=1时等价于DeepWalk

3、DeepWalk无太多可调的超参数.Line可选择1st Order、2st Order或者组合方式

4、Deep Walk和Node2vec基于随机游走的算法，LINE是基于网络结构启发的算法

三、Realted Work

介绍相关传统基于图的算法

四、Edge Sampling

介绍alias sampling技术，时间复杂度分析、低度点和新点的处理

低度点: 难以学习，因为度低邻居少，2阶相似度难学，用高阶信息补充，这里面采用的是邻居的邻居

新点: 保持其他点的embedding不变，计算新点的embedding

五、Model optimization

介绍模型的优化、负采样技术、SGD算法

六、LINE算法

介绍一阶、二阶相似度以及两者的混合

七、Effectiveness:

实验探究了模型的有效性、baseline选择参数设定以及数据集多个任务

八、Experiment

可视化、性能、网络稀疏度、参数实验以及规模性

九、Discussion

总结提出一种根据网络结构的目标函数网络表征学习方法

关键点：图的一阶和二阶相似度的理解、图的一阶、二阶相似度转化为目标函数、公式的推导、时间复杂度分析、图的负采样alias sampling

创新点: 根据图的信息直接建模、大规模数据集上的应用、丰富实验的论证效果

启发点: 图的理解对于网络表征学习的作用、算法设计通过KL散度将图的预测值和真实值比较

十、代码

先随机模拟一部分数据集,格式主要是(node1,node2,wight),
存在文件里面文件名-weighted.karate.edgelist

如下所示:

1 32 1
1 22 1
1 20 1
1 18 1
1 14 1
1 13 1
1 12 1
1 11 1
1 9 1
1 8 1
1 7 1
1 6 1
1 5 1
1 4 1
1 3 1
1 2 1
2 31 1
2 22 1
2 20 1
2 18 1
2 14 1
2 8 1
2 4 1
2 3 1
3 14 1
3 9 1
3 10 1
3 33 1
3 29 1
3 28 1
3 8 1
3 4 1
4 14 1
4 13 1
4 8 1
5 11 1
5 7 1
6 17 1
6 11 1
6 7 1
7 17 1
9 34 1
9 33 1
9 33 1
10 34 1
14 34 1
15 34 1
15 33 1
16 34 1
16 33 1
19 34 1
19 33 1
20 34 1
21 34 1
21 33 1
23 34 1
23 33 1
24 30 1
24 34 1
24 33 1
24 28 1
24 26 1
25 32 1
25 28 1
25 26 1
26 32 1
27 34 1
27 30 1
28 34 1
29 34 1
29 32 1
30 34 1
30 33 1
31 34 1
31 33 1
32 34 1
32 33 1
33 34 1

/*** 代码部分 ***/


"""
1. 设置模型参数; 读图，存点和边并做归一化

1) 设置模型参数
设置模型超参数，如1st order, 2nd order，负样本数量(K), embedding维度, batch、epoch、learning rate等

2）输入输出

输入文件 './data/weighted.karate.edgelist'

输出文件 './model.pt'

"""

/*** utils.py ***/

import random
from decimal import *
import numpy as np
import collections
from tqdm import tqdm


class VoseAlias(object):
    """
    Adding a few modifs to https://github.com/asmith26/Vose-Alias-Method
    """

    def __init__(self, dist):
        """
        (VoseAlias, dict) -> NoneType
        """
        self.dist = dist
        self.alias_initialisation()

    def alias_initialisation(self):
        """
        Construct probability and alias tables for the distribution.
        """
        # Initialise variables
        n = len(self.dist)
        self.table_prob = {}   # probability table
        self.table_alias = {}  # alias table
        scaled_prob = {}       # scaled probabilities
        small = []             # stack for probabilities smaller that 1
        large = []             # stack for probabilities greater than or equal to 1

        # Construct and sort the scaled probabilities into their appropriate stacks
        print("1/2. Building and sorting scaled probabilities for alias table...")
        for o, p in tqdm(self.dist.items()):
            scaled_prob[o] = Decimal(p) * n

            if scaled_prob[o] < 1:
                small.append(o)
            else:
                large.append(o)

        print("2/2. Building alias table...")
        # Construct the probability and alias tables
        while small and large:
            s = small.pop()
            l = large.pop()

            self.table_prob[s] = scaled_prob[s]
            self.table_alias[s] = l

            scaled_prob[l] = (scaled_prob[l] + scaled_prob[s]) - Decimal(1)

            if scaled_prob[l] < 1:
                small.append(l)
            else:
                large.append(l)

        # The remaining outcomes (of one stack) must have probability 1
        while large:
            self.table_prob[large.pop()] = Decimal(1)

        while small:
            self.table_prob[small.pop()] = Decimal(1)
        self.listprobs = list(self.table_prob)

    def alias_generation(self):
        """
        Yields a random outcome from the distribution.
        """
        # Determine which column of table_prob to inspect
        col = random.choice(self.listprobs)
        # Determine which outcome to pick in that column
        if self.table_prob[col] >= random.uniform(0, 1):
            return col
        else:
            return self.table_alias[col]

    def sample_n(self, size):
        """
        Yields a sample of size n from the distribution, and print the results to stdout.
        """
        for i in range(size):
            yield self.alias_generation()


def makeDist(graphpath, power=0.75):

    edgedistdict = collections.defaultdict(int)
    nodedistdict = collections.defaultdict(int)

    weightsdict = collections.defaultdict(int)
    nodedegrees = collections.defaultdict(int)

    weightsum = 0
    negprobsum = 0

    nlines = 0

    with open(graphpath, "r") as graphfile:
        for l in graphfile:
            nlines += 1

    print("Reading edgelist file...")
    maxindex = 0
    with open(graphpath, "r") as graphfile:
        for l in tqdm(graphfile, total=nlines):

            line = [int(i) for i in l.replace("\n", "").split(" ")]
            node1, node2, weight = line[0], line[1], line[2]

            edgedistdict[tuple([node1, node2])] = weight
            nodedistdict[node1] += weight

            weightsdict[tuple([node1, node2])] = weight
            nodedegrees[node1] += weight

            weightsum += weight
            negprobsum += np.power(weight, power)

            if node1 > maxindex:
                maxindex = node1
            elif node2 > maxindex:
                maxindex = node2
    for node, outdegree in nodedistdict.items():
        nodedistdict[node] = np.power(outdegree, power) / negprobsum

    for edge, weight in edgedistdict.items():
        edgedistdict[edge] = weight / weightsum

    return edgedistdict, nodedistdict, weightsdict, nodedegrees, maxindex


def negSampleBatch(sourcenode, targetnode, negsamplesize, weights,
                   nodedegrees, nodesaliassampler, t=10e-3):
    """
    For generating negative samples.
    """
    negsamples = 0
    while negsamples < negsamplesize:
        samplednode = nodesaliassampler.sample_n(1)
        if (samplednode == sourcenode) or (samplednode == targetnode):
            continue
        else:
            negsamples += 1
            yield samplednode


def makeData(samplededges, negsamplesize, weights, nodedegrees, nodesaliassampler):
    for e in samplededges:
        sourcenode, targetnode = e[0], e[1]
        negnodes = []
        for negsample in negSampleBatch(sourcenode, targetnode, negsamplesize,
                                        weights, nodedegrees, nodesaliassampler):
            for node in negsample:
                negnodes.append(node)
        yield [e[0], e[1]] + negnodes


/*** line.py ***/

import torch
import torch.nn as nn
import torch.nn.functional as F


class Line(nn.Module):
    def __init__(self, size, embed_dim=128, order=1):
        super(Line, self).__init__()

        assert order in [1, 2], print("Order should either be int(1) or int(2)")

        self.embed_dim = embed_dim
        self.order = order
        self.nodes_embeddings = nn.Embedding(size, embed_dim)

        if order == 2:
            self.contextnodes_embeddings = nn.Embedding(size, embed_dim)
            # Initialization
            self.contextnodes_embeddings.weight.data = self.contextnodes_embeddings.weight.data.uniform_(
                -.5, .5) / embed_dim

        # Initialization
        self.nodes_embeddings.weight.data = self.nodes_embeddings.weight.data.uniform_(
            -.5, .5) / embed_dim

    def forward(self, v_i, v_j, negsamples, device):

        v_i = self.nodes_embeddings(v_i).to(device)

        if self.order == 2:
            v_j = self.contextnodes_embeddings(v_j).to(device)
            negativenodes = -self.contextnodes_embeddings(negsamples).to(device)

        else:
            v_j = self.nodes_embeddings(v_j).to(device)
            negativenodes = -self.nodes_embeddings(negsamples).to(device)

        mulpositivebatch = torch.mul(v_i, v_j)
        positivebatch = F.logsigmoid(torch.sum(mulpositivebatch, dim=1))

        mulnegativebatch = torch.mul(v_i.view(len(v_i), 1, self.embed_dim), negativenodes)
        negativebatch = torch.sum(
            F.logsigmoid(
                torch.sum(mulnegativebatch, dim=2)
            ),
            dim=1)
        loss = positivebatch + negativebatch
        return -torch.mean(loss)



import argparse
from utils.utils import *
from utils.line import Line
from tqdm import trange
import torch
import torch.optim as optim
import sys
import pickle


# 使用parser加载信息
parser = argparse.ArgumentParser()
# 输入文件
parser.add_argument("-g", "--graph_path", type=str, default='./data/weighted.karate.edgelist')
# 模型信息输出文件
parser.add_argument("-save", "--save_path", type=str, default='./model.pt')
# 模型损失函数值输出文件
parser.add_argument("-lossdata", "--lossdata_path", type=str, default='./loss.pkl')

# Hyperparams. 超参数
# 论文中的1st order, 2nd order
parser.add_argument("-order", "--order", type=int, default=2)
# 负样本数量
parser.add_argument("-neg", "--negsamplesize", type=int, default=5)
# embedding维度
parser.add_argument("-dim", "--dimension", type=int, default=128)
# batch大小
parser.add_argument("-batchsize", "--batchsize", type=int, default=5)
# epoch数量
parser.add_argument("-epochs", "--epochs", type=int, default=1)
# 学习率设置
parser.add_argument("-lr", "--learning_rate", type=float,
                default=0.025)  # As starting value in paper
# 负采样指数值设置
parser.add_argument("-negpow", "--negativepower", type=float, default=0.75)
args = parser.parse_args()


### 2. 读图，存点和边并做归一化

edgedistdict, nodedistdict, weights, nodedegrees, maxindex = makeDist(
args.graph_path, args.negativepower)


### 3. 计算点和边的alias table

# 构建alias table,达到O(1)的采样效率
edgesaliassampler = VoseAlias(edgedistdict)
nodesaliassampler = VoseAlias(nodedistdict)


### 4. Line模型实现

# 按batchsize将训练样本分组
batchrange = int(len(edgedistdict) / args.batchsize)
print(maxindex)
# line.py中的nn.Module类
line = Line(maxindex + 1, embed_dim=args.dimension, order=args.order)
# SGD算法优化模型
opt = optim.SGD(line.parameters(), lr=args.learning_rate,
            momentum=0.9, nesterov=True)


### 5.模型按边训练以及负采样

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

lossdata = {"it": [], "loss": []}
it = 0
helper = 0

print("\nTraining on {}...\n".format(device))
# 共训练epoch次数
for epoch in range(args.epochs):
    print("Epoch {}".format(epoch))
    # 每次训练组数：batchsize
    for b in trange(batchrange):
        # edgesaliassampler是实现alias building的VoseAlias类，这里采样出batchsize条边
        samplededges = edgesaliassampler.sample_n(args.batchsize)
        # makeData是utils.py中的函数，为每条边采样出K条负样本边
        # 每一条格式是(node i, node j, negative nodes...)
        batch = list(makeData(samplededges, args.negsamplesize, weights, nodedegrees,
                              nodesaliassampler))
        # 转换成tensor格式
        batch = torch.LongTensor(batch)
        if helper == 0:
            print (batch)
            helper = 1
        # 第0列
        v_i = batch[:, 0]
        # 第1列
        v_j = batch[:, 1]
        # 第2列-最后列
        negsamples = batch[:, 2:]
        # 在做BP之前将gradients置0因为是累加的
        line.zero_grad()
        # Line模型实现部分
        loss = line(v_i, v_j, negsamples, device)
        # 计算梯度
        loss.backward()
        # 根据梯度值更新参数值
        opt.step()

        lossdata["loss"].append(loss.item())
        lossdata["it"].append(it)
        it += 1

print("\nDone training, saving model to {}".format(args.save_path))
torch.save(line, "{}".format(args.save_path))

print("Saving loss data at {}".format(args.lossdata_path))
with open(args.lossdata_path, "wb") as ldata:
    pickle.dump(lossdata, ldata)


### 6.结果展示和可视化

from sklearn import  cluster
import numpy as np

embedding_node=[]
for i in range(1,35):
    input = torch.LongTensor([i])
    t = line.nodes_embeddings(input)
    embedding_node.append(t.tolist()[0])
embedding_node=np.matrix(embedding_node).reshape((34,-1))
y_pred = cluster.KMeans(n_clusters=3, random_state=9).fit_predict(embedding_node) # 调用 test_RandomForestClassifier
y_pred

BUPT-WT

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
LINE: Large-scale Information Network Embedding

LINE算法意义:1、适用于任意类型的网络，有向无向有权无权2、清晰优化目标函数、维护一阶和二阶相似度3、百万级和十亿级条边几个小时训练完4、LINE是WWW2015引用量最高的文章5、与Deepwalk(2014)、node2vec(2016)一样是早期网络学习的代表性工作，经典baselines6、启发大量基于网络结构来做表征学习的工作图学习领域(人工提取特征-基于特征工程) <-------------- Deepwalk、Line、Node2vec -...
复制链接

扫一扫

专栏目录