LINE算法意义:
1、适用于任意类型的网络,有向无向有权无权
2、清晰优化目标函数、维护一阶和二阶相似度
3、百万级和十亿级条边几个小时训练完
4、LINE是WWW2015引用量最高的文章
5、与Deepwalk(2014)、node2vec(2016)一样是早期网络学习的代表性工作,经典baselines
6、启发大量基于网络结构来做表征学习的工作
图学习领域 (人工提取特征-基于特征工程) <-------------- Deepwalk、Line、Node2vec --------------> 深度学习领域(特征和分类于一体 - 基于神经网络)
论文下载地址: http://de.arxiv.org/pdf/1503.03578
这篇论文主要结构如下所示:
一、摘要Abstract
介绍背景提出LINE模型,维护局部和全局网络结构
1、提出LINE算法适用于任意类型的信息 网络: 有向/无向,有权/无权
2、LINE算法的目标函数同时保留了局部和全局的网络结构
3、讨论算法如何在多个领域上如单词网络、社交网络和引用网络都验证了模型的有效性
4、强调LINE算法的规模性,单机上几个小时之内可训练百万级网络
二、Introduction
介绍图的重要性与以前的方法做对比,说明DeepWalk缺乏清晰的目标函数
1、提出的顺序DeepWalk-2014、Line-2015、Node2vec-2016
2、Node2vec设置p=q=1时等价于DeepWalk
3、DeepWalk无太多可调的超参数.Line可选择1st Order、2st Order或者组合方式
4、Deep Walk和Node2vec基于随机游走的算法,LINE是基于网络结构启发的算法
三、Realted Work
介绍相关传统基于图的算法
四、Edge Sampling
介绍alias sampling技术,时间复杂度分析、低度点和新点的处理
低度点: 难以学习,因为度低邻居少,2阶相似度难学,用高阶信息补充,这里面采用的是邻居的邻居
新点: 保持其他点的embedding不变,计算新点的embedding
五、Model optimization
介绍模型的优化、负采样技术、SGD算法
六、LINE算法
介绍一阶、二阶相似度以及两者的混合
七、Effectiveness:
实验探究了模型的有效性、baseline选择参数设定以及数据集多个任务
八、Experiment
可视化、性能、网络稀疏度、参数实验以及规模性
九、Discussion
总结提出一种根据网络结构的目标函数网络表征学习方法
关键点:图的一阶和二阶相似度的理解、图的一阶、二阶相似度转化为目标函数、公式的推导、时间复杂度分析、图的负采样alias sampling
创新点: 根据图的信息直接建模、大规模数据集上的应用、丰富实验的论证效果
启发点: 图的理解对于网络表征学习的作用、算法设计通过KL散度将图的预测值和真实值比较
十、代码
先随机模拟一部分数据集,格式主要是(node1,node2,wight),
存在文件里面文件名-weighted.karate.edgelist
如下所示:
1 32 1
1 22 1
1 20 1
1 18 1
1 14 1
1 13 1
1 12 1
1 11 1
1 9 1
1 8 1
1 7 1
1 6 1
1 5 1
1 4 1
1 3 1
1 2 1
2 31 1
2 22 1
2 20 1
2 18 1
2 14 1
2 8 1
2 4 1
2 3 1
3 14 1
3 9 1
3 10 1
3 33 1
3 29 1
3 28 1
3 8 1
3 4 1
4 14 1
4 13 1
4 8 1
5 11 1
5 7 1
6 17 1
6 11 1
6 7 1
7 17 1
9 34 1
9 33 1
9 33 1
10 34 1
14 34 1
15 34 1
15 33 1
16 34 1
16 33 1
19 34 1
19 33 1
20 34 1
21 34 1
21 33 1
23 34 1
23 33 1
24 30 1
24 34 1
24 33 1
24 28 1
24 26 1
25 32 1
25 28 1
25 26 1
26 32 1
27 34 1
27 30 1
28 34 1
29 34 1
29 32 1
30 34 1
30 33 1
31 34 1
31 33 1
32 34 1
32 33 1
33 34 1
/*** 代码部分 ***/
"""
1. 设置模型参数; 读图,存点和边并做归一化
1) 设置模型参数
设置模型超参数,如1st order, 2nd order,负样本数量(K), embedding维度, batch、epoch、learning rate等
2)输入输出
输入文件 './data/weighted.karate.edgelist'
输出文件 './model.pt'
"""
/*** utils.py ***/
import random
from decimal import *
import numpy as np
import collections
from tqdm import tqdm
class VoseAlias(object):
"""
Adding a few modifs to https://github.com/asmith26/Vose-Alias-Method
"""
def __init__(self, dist):
"""
(VoseAlias, dict) -> NoneType
"""
self.dist = dist
self.alias_initialisation()
def alias_initialisation(self):
"""
Construct probability and alias tables for the distribution.
"""
# Initialise variables
n = len(self.dist)
self.table_prob = {} # probability table
self.table_alias = {} # alias table
scaled_prob = {} # scaled probabilities
small = [] # stack for probabilities smaller that 1
large = [] # stack for probabilities greater than or equal to 1
# Construct and sort the scaled probabilities into their appropriate stacks
print("1/2. Building and sorting scaled probabilities for alias table...")
for o, p in tqdm(self.dist.items()):
scaled_prob[o] = Decimal(p) * n
if scaled_prob[o] < 1:
small.append(o)
else:
large.append(o)
print("2/2. Building alias table...")
# Construct the probability and alias tables
while small and large:
s = small.pop()
l = large.pop()
self.table_prob[s] = scaled_prob[s]
self.table_alias[s] = l
scaled_prob[l] = (scaled_prob[l] + scaled_prob[s]) - Decimal(1)
if scaled_prob[l] < 1:
small.append(l)
else:
large.append(l)
# The remaining outcomes (of one stack) must have probability 1
while large:
self.table_prob[large.pop()] = Decimal(1)
while small:
self.table_prob[small.pop()] = Decimal(1)
self.listprobs = list(self.table_prob)
def alias_generation(self):
"""
Yields a random outcome from the distribution.
"""
# Determine which column of table_prob to inspect
col = random.choice(self.listprobs)
# Determine which outcome to pick in that column
if self.table_prob[col] >= random.uniform(0, 1):
return col
else:
return self.table_alias[col]
def sample_n(self, size):
"""
Yields a sample of size n from the distribution, and print the results to stdout.
"""
for i in range(size):
yield self.alias_generation()
def makeDist(graphpath, power=0.75):
edgedistdict = collections.defaultdict(int)
nodedistdict = collections.defaultdict(int)
weightsdict = collections.defaultdict(int)
nodedegrees = collections.defaultdict(int)
weightsum = 0
negprobsum = 0
nlines = 0
with open(graphpath, "r") as graphfile:
for l in graphfile:
nlines += 1
print("Reading edgelist file...")
maxindex = 0
with open(graphpath, "r") as graphfile:
for l in tqdm(graphfile, total=nlines):
line = [int(i) for i in l.replace("\n", "").split(" ")]
node1, node2, weight = line[0], line[1], line[2]
edgedistdict[tuple([node1, node2])] = weight
nodedistdict[node1] += weight
weightsdict[tuple([node1, node2])] = weight
nodedegrees[node1] += weight
weightsum += weight
negprobsum += np.power(weight, power)
if node1 > maxindex:
maxindex = node1
elif node2 > maxindex:
maxindex = node2
for node, outdegree in nodedistdict.items():
nodedistdict[node] = np.power(outdegree, power) / negprobsum
for edge, weight in edgedistdict.items():
edgedistdict[edge] = weight / weightsum
return edgedistdict, nodedistdict, weightsdict, nodedegrees, maxindex
def negSampleBatch(sourcenode, targetnode, negsamplesize, weights,
nodedegrees, nodesaliassampler, t=10e-3):
"""
For generating negative samples.
"""
negsamples = 0
while negsamples < negsamplesize:
samplednode = nodesaliassampler.sample_n(1)
if (samplednode == sourcenode) or (samplednode == targetnode):
continue
else:
negsamples += 1
yield samplednode
def makeData(samplededges, negsamplesize, weights, nodedegrees, nodesaliassampler):
for e in samplededges:
sourcenode, targetnode = e[0], e[1]
negnodes = []
for negsample in negSampleBatch(sourcenode, targetnode, negsamplesize,
weights, nodedegrees, nodesaliassampler):
for node in negsample:
negnodes.append(node)
yield [e[0], e[1]] + negnodes
/*** line.py ***/
import torch
import torch.nn as nn
import torch.nn.functional as F
class Line(nn.Module):
def __init__(self, size, embed_dim=128, order=1):
super(Line, self).__init__()
assert order in [1, 2], print("Order should either be int(1) or int(2)")
self.embed_dim = embed_dim
self.order = order
self.nodes_embeddings = nn.Embedding(size, embed_dim)
if order == 2:
self.contextnodes_embeddings = nn.Embedding(size, embed_dim)
# Initialization
self.contextnodes_embeddings.weight.data = self.contextnodes_embeddings.weight.data.uniform_(
-.5, .5) / embed_dim
# Initialization
self.nodes_embeddings.weight.data = self.nodes_embeddings.weight.data.uniform_(
-.5, .5) / embed_dim
def forward(self, v_i, v_j, negsamples, device):
v_i = self.nodes_embeddings(v_i).to(device)
if self.order == 2:
v_j = self.contextnodes_embeddings(v_j).to(device)
negativenodes = -self.contextnodes_embeddings(negsamples).to(device)
else:
v_j = self.nodes_embeddings(v_j).to(device)
negativenodes = -self.nodes_embeddings(negsamples).to(device)
mulpositivebatch = torch.mul(v_i, v_j)
positivebatch = F.logsigmoid(torch.sum(mulpositivebatch, dim=1))
mulnegativebatch = torch.mul(v_i.view(len(v_i), 1, self.embed_dim), negativenodes)
negativebatch = torch.sum(
F.logsigmoid(
torch.sum(mulnegativebatch, dim=2)
),
dim=1)
loss = positivebatch + negativebatch
return -torch.mean(loss)
import argparse
from utils.utils import *
from utils.line import Line
from tqdm import trange
import torch
import torch.optim as optim
import sys
import pickle
# 使用parser加载信息
parser = argparse.ArgumentParser()
# 输入文件
parser.add_argument("-g", "--graph_path", type=str, default='./data/weighted.karate.edgelist')
# 模型信息输出文件
parser.add_argument("-save", "--save_path", type=str, default='./model.pt')
# 模型损失函数值输出文件
parser.add_argument("-lossdata", "--lossdata_path", type=str, default='./loss.pkl')
# Hyperparams. 超参数
# 论文中的1st order, 2nd order
parser.add_argument("-order", "--order", type=int, default=2)
# 负样本数量
parser.add_argument("-neg", "--negsamplesize", type=int, default=5)
# embedding维度
parser.add_argument("-dim", "--dimension", type=int, default=128)
# batch大小
parser.add_argument("-batchsize", "--batchsize", type=int, default=5)
# epoch数量
parser.add_argument("-epochs", "--epochs", type=int, default=1)
# 学习率设置
parser.add_argument("-lr", "--learning_rate", type=float,
default=0.025) # As starting value in paper
# 负采样指数值设置
parser.add_argument("-negpow", "--negativepower", type=float, default=0.75)
args = parser.parse_args()
### 2. 读图,存点和边并做归一化
edgedistdict, nodedistdict, weights, nodedegrees, maxindex = makeDist(
args.graph_path, args.negativepower)
### 3. 计算点和边的alias table
# 构建alias table,达到O(1)的采样效率
edgesaliassampler = VoseAlias(edgedistdict)
nodesaliassampler = VoseAlias(nodedistdict)
### 4. Line模型实现
# 按batchsize将训练样本分组
batchrange = int(len(edgedistdict) / args.batchsize)
print(maxindex)
# line.py中的nn.Module类
line = Line(maxindex + 1, embed_dim=args.dimension, order=args.order)
# SGD算法优化模型
opt = optim.SGD(line.parameters(), lr=args.learning_rate,
momentum=0.9, nesterov=True)
### 5.模型按边训练以及负采样
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
lossdata = {"it": [], "loss": []}
it = 0
helper = 0
print("\nTraining on {}...\n".format(device))
# 共训练epoch次数
for epoch in range(args.epochs):
print("Epoch {}".format(epoch))
# 每次训练组数:batchsize
for b in trange(batchrange):
# edgesaliassampler是实现alias building的VoseAlias类,这里采样出batchsize条边
samplededges = edgesaliassampler.sample_n(args.batchsize)
# makeData是utils.py中的函数,为每条边采样出K条负样本边
# 每一条格式是(node i, node j, negative nodes...)
batch = list(makeData(samplededges, args.negsamplesize, weights, nodedegrees,
nodesaliassampler))
# 转换成tensor格式
batch = torch.LongTensor(batch)
if helper == 0:
print (batch)
helper = 1
# 第0列
v_i = batch[:, 0]
# 第1列
v_j = batch[:, 1]
# 第2列-最后列
negsamples = batch[:, 2:]
# 在做BP之前将gradients置0因为是累加的
line.zero_grad()
# Line模型实现部分
loss = line(v_i, v_j, negsamples, device)
# 计算梯度
loss.backward()
# 根据梯度值更新参数值
opt.step()
lossdata["loss"].append(loss.item())
lossdata["it"].append(it)
it += 1
print("\nDone training, saving model to {}".format(args.save_path))
torch.save(line, "{}".format(args.save_path))
print("Saving loss data at {}".format(args.lossdata_path))
with open(args.lossdata_path, "wb") as ldata:
pickle.dump(lossdata, ldata)
### 6.结果展示和可视化
from sklearn import cluster
import numpy as np
embedding_node=[]
for i in range(1,35):
input = torch.LongTensor([i])
t = line.nodes_embeddings(input)
embedding_node.append(t.tolist()[0])
embedding_node=np.matrix(embedding_node).reshape((34,-1))
y_pred = cluster.KMeans(n_clusters=3, random_state=9).fit_predict(embedding_node) # 调用 test_RandomForestClassifier
y_pred