【论文笔记】LINE: Large-scale Information Network Embedding

[paper] https://dl.acm.org/doi/pdf/10.1145/2736277.2741093

[code] C++\PYTHON TF

  • 摘要

In this paper, we propose a novel network embedding method called the “LINE,” which is suitable for arbitrary types of information networks: undirected, directed, and/or weighted.

这篇论文主要提出了针对大规模节点的信息网络的GE,效率比较高。

  • introduction

The model optimizes an objective which preserves both the local and global network structures.

模型优化的目标保留了局部和全局的网络结构

  • We propose a novel network embedding model called the “LINE,” which suits arbitrary types of information networks and easily scales to millions of nodes. It has a carefully designed objective function that preserves both the first-order and second-order proximities.

  • We propose an edge-sampling algorithm for optimizing the objective. The algorithm tackles the limitation of the classical stochastic gradient decent and improves the effectiveness and efficiency of the inference.

  • We conduct extensive experiments on real world information networks. Experimental results prove the effectiveness and efficiency of the proposed LINE model.

主要三个贡献:1⃣️LINE适合任意类型的百万级别信息网络,目标函数设计的保留了一阶和二阶的相似性2⃣️提出了一个边采样的方法来优化目标函数,这个采样的方法解决了分类随机梯度下降的局限,并且提高了推论的效率3⃣️LINE模型在大量真实信息网络中表现不错

  • problem definition

论文不仅定义了节点、边,还考虑了权重。

  • first-order proximity

first-order proximity只适用于无向图,最小化eqn.3,可以找到每个节点的向量化表示

  • second-order proximity

优化目标为O2,其中 为控制节点重要性的因子,可以通过顶点的度数或者PageRank等方法估计得到。

second-order proximity 可以用在有向或者无向图中,若是在无向图中,就考虑两条相反方向的边,同时权重互为相反数

  • combine first-order and second-order

联合eqn.3、eqn.6进行优化获得表示向量

  • Model Optimization

    • negative sampling:由于计算2阶相似度时,softmax函数的分母计算需要遍历所有顶点,这是非常低效的,论文采用了负采样优化的技巧。

 

其中,dv是顶点v的出度

源码【code】

  • python tensorflow

  • 整体结构

  • 注意

networkx版本为1.9.0,2.x会运行不成功。(后已做修改,2.x可运行)

tensorflow版本为1.3.0,2.x会运行不成功。

  • utils.py(主要工具包括读取图数据文件进行构图,以及alias采样方法)

import networkx as nx
import numpy as np


class DBLPDataLoader:
    def __init__(self, graph_file):
      
      	# 读取pickle文件
        self.g = nx.read_gpickle(graph_file)
        self.num_of_nodes = self.g.number_of_nodes()
        self.num_of_edges = self.g.number_of_edges()
        """
        	读取nodes、edges数据。
          格式为edges:(a,b,dict)\nodes:(a,dict)
          其中dict为属性集
        """
        self.edges_raw = self.g.edges(data=True)
        self.nodes_raw = self.g.nodes(data=True)
        
        # 计算边的分布
        self.edge_distribution = np.array([attr['weight'] for _, _, attr in self.edges_raw], dtype=np.float32)
        self.edge_distribution /= np.sum(self.edge_distribution)
        
        # 对边进行alias采样
        self.edge_sampling = AliasSampling(prob=self.edge_distribution)
        # 计算节点进行负采样的分布
        self.node_negative_distribution = np.power(
            np.array([self.g.degree(node, weight='weight') for node, _ in self.nodes_raw], dtype=np.float32), 0.75)
        self.node_negative_distribution /= np.sum(self.node_negative_distribution)
        # 对节点进行alias采样
        self.node_sampling = AliasSampling(prob=self.node_negative_distribution)

        # 记录node的下标
        self.node_index = {}
        self.node_index_reversed = {}
        for index, (node, _) in enumerate(self.nodes_raw):
            self.node_index[node] = index
            self.node_index_reversed[index] = node
        self.edges = [(self.node_index[u], self.node_index[v]) for u, v, _ in self.edges_raw]

    def fetch_batch(self, batch_size=16, K=10, edge_sampling='atlas', node_sampling='atlas'):
      	"对边的采样"
        if edge_sampling == 'numpy':
          	# 依据边的概率分布采样batch_size条边
            edge_batch_index = np.random.choice(self.num_of_edges, size=batch_size, p=self.edge_distribution)
        elif edge_sampling == 'atlas':
          	# 利用alias采样batch_size条边
            edge_batch_index = self.edge_sampling.sampling(batch_size)
        elif edge_sampling == 'uniform':
          	# 随机从边的列表里挑出batch_size条边
            edge_batch_index = np.random.randint(0, self.num_of_edges, size=batch_size)
        u_i = []
        u_j = []
        label = []
        for edge_index in edge_batch_index:
          	# 取出边
            edge = self.edges[edge_index] #:a->b
            # 如果图形类型为无向图 随机指定边的方向
            if self.g.__class__ == nx.Graph:
                if np.random.rand() > 0.5:      # important: second-order proximity is for directed edge
                    edge = (edge[1], edge[0])
            u_i.append(edge[0])
            u_j.append(edge[1])
            label.append(1)
            "对节点的采样"
            for i in range(K):
              	# 循环执行这个对节点的负采样,直到找到和刚才存入edge[0]存在边
                while True:
                    if node_sampling == 'numpy':
                        negative_node = np.random.choice(self.num_of_nodes, p=self.node_negative_distribution)
                    elif node_sampling == 'atlas':
                        negative_node = self.node_sampling.sampling()
                    elif node_sampling == 'uniform':
                        negative_node = np.random.randint(0, self.num_of_nodes)
                    if not self.g.has_edge(self.node_index_reversed[negative_node], self.node_index_reversed[edge[0]]):
                        break
                u_i.append(edge[0])
                u_j.append(negative_node)
                # 负采样节点的label为-1
                label.append(-1)
        return u_i, u_j, label

    def embedding_mapping(self, embedding):
        return {node: embedding[self.node_index[node]] for node, _ in self.nodes_raw}


class AliasSampling:

    # Reference: https://en.wikipedia.org/wiki/Alias_method

    def __init__(self, prob):
        self.n = len(prob)
        self.U = np.array(prob) * self.n
        self.K = [i for i in range(len(prob))]
        overfull, underfull = [], []
        for i, U_i in enumerate(self.U):
            if U_i > 1:
                overfull.append(i)
            elif U_i < 1:
                underfull.append(i)
        while len(overfull) and len(underfull):
            i, j = overfull.pop(), underfull.pop()
            self.K[j] = i
            self.U[i] = self.U[i] - (1 - self.U[j])
            if self.U[i] > 1:
                overfull.append(i)
            elif self.U[i] < 1:
                underfull.append(i)

    # 模拟随机采样
    def sampling(self, n=1):
      	# 返回n个[0,1]之间的随机数
        x = np.random.rand(n)
        i = np.floor(self.n * x)
        y = self.n * x - i
        i = i.astype(np.int32)
        # 第k列采样哪一个
        res = [i[k] if y[k] < self.U[i[k]] else self.K[i[k]] for k in range(n)]
        if n == 1:
            return res[0]
        else:
            return res

修改了读取文件为边表文件.txt格式,并且针对无权图,默认将边权改为1

import networkx as nx
import numpy as np
import pickle


class DBLPDataLoader:
    def __init__(self, graph_file):
        self.g = nx.read_edgelist(graph_file, create_using=nx.DiGraph(), nodetype=None, data=[('weight', int)])
        self.num_of_nodes = self.g.number_of_nodes()
        self.num_of_edges = self.g.number_of_edges()

        self.edges_raw = self.g.edges(data=True)
        self.nodes_raw = self.g.nodes(data=True)

        self.edge_distribution = np.array([attr.get('weight', 1) for _, _, attr in self.edges_raw], dtype=np.float32)
        self.edge_distribution /= np.sum(self.edge_distribution)
        self.edge_sampling = AliasSampling(prob=self.edge_distribution)
        self.node_negative_distribution = np.power(
            np.array([self.g.degree(node, weight='weight') for node, _ in self.nodes_raw], dtype=np.float32), 0.75)
        self.node_negative_distribution /= np.sum(self.node_negative_distribution)
        self.node_sampling = AliasSampling(prob=self.node_negative_distribution)

        self.node_index = {}
        self.node_index_reversed = {}
        for index, (node, _) in enumerate(self.nodes_raw):
            self.node_index[node] = index
            self.node_index_reversed[index] = node
        self.edges = [(self.node_index[u], self.node_index[v]) for u, v, _ in self.edges_raw]

    def fetch_batch(self, batch_size=16, K=10, edge_sampling='atlas', node_sampling='atlas'):
        if edge_sampling == 'numpy':
            edge_batch_index = np.random.choice(self.num_of_edges, size=batch_size, p=self.edge_distribution)
        elif edge_sampling == 'atlas':
            edge_batch_index = self.edge_sampling.sampling(batch_size)
        elif edge_sampling == 'uniform':
            edge_batch_index = np.random.randint(0, self.num_of_edges, size=batch_size)
        u_i = []
        u_j = []
        label = []
        for edge_index in edge_batch_index:
            edge = self.edges[edge_index]
            if self.g.__class__ == nx.Graph:
                if np.random.rand() > 0.5:      # important: second-order proximity is for directed edge
                    edge = (edge[1], edge[0])
            u_i.append(edge[0])
            u_j.append(edge[1])
            label.append(1)
            for i in range(K):
                while True:
                    if node_sampling == 'numpy':
                        negative_node = np.random.choice(self.num_of_nodes, p=self.node_negative_distribution)
                    elif node_sampling == 'atlas':
                        negative_node = self.node_sampling.sampling()
                    elif node_sampling == 'uniform':
                        negative_node = np.random.randint(0, self.num_of_nodes)
                    if not self.g.has_edge(self.node_index_reversed[negative_node], self.node_index_reversed[edge[0]]):
                        break
                u_i.append(edge[0])
                u_j.append(negative_node)
                label.append(-1)
        return u_i, u_j, label

    def embedding_mapping(self, embedding):
        return {node: embedding[self.node_index[node]] for node, _ in self.nodes_raw}


class AliasSampling:

    # Reference: https://en.wikipedia.org/wiki/Alias_method

    def __init__(self, prob):
        self.n = len(prob)
        self.U = np.array(prob) * self.n
        self.K = [i for i in range(len(prob))]
        overfull, underfull = [], []
        for i, U_i in enumerate(self.U):
            if U_i > 1:
                overfull.append(i)
            elif U_i < 1:
                underfull.append(i)
        while len(overfull) and len(underfull):
            i, j = overfull.pop(), underfull.pop()
            self.K[j] = i
            self.U[i] = self.U[i] - (1 - self.U[j])
            if self.U[i] > 1:
                overfull.append(i)
            elif self.U[i] < 1:
                underfull.append(i)

    def sampling(self, n=1):
        x = np.random.rand(n)
        i = np.floor(self.n * x)
        y = self.n * x - i
        i = i.astype(np.int32)
        res = [i[k] if y[k] < self.U[i[k]] else self.K[i[k]] for k in range(n)]
        if n == 1:
            return res[0]
        else:
            return res
  • model.py

import tensorflow as tf


class LINEModel:
    def __init__(self, args):
      	
        self.u_i = tf.placeholder(name='u_i', dtype=tf.int32, shape=[args.batch_size * (args.K + 1)])
        self.u_j = tf.placeholder(name='u_j', dtype=tf.int32, shape=[args.batch_size * (args.K + 1)])
        self.label = tf.placeholder(name='label', dtype=tf.float32, shape=[args.batch_size * (args.K + 1)])
        self.embedding = tf.get_variable('target_embedding', [args.num_of_nodes, args.embedding_dim],
                                         initializer=tf.random_uniform_initializer(minval=-1., maxval=1.))
        self.u_i_embedding = tf.matmul(tf.one_hot(self.u_i, depth=args.num_of_nodes), self.embedding)
        if args.proximity == 'first-order':
            self.u_j_embedding = tf.matmul(tf.one_hot(self.u_j, depth=args.num_of_nodes), self.embedding)
        elif args.proximity == 'second-order':
            self.context_embedding = tf.get_variable('context_embedding', [args.num_of_nodes, args.embedding_dim],
                                                     initializer=tf.random_uniform_initializer(minval=-1., maxval=1.))
            self.u_j_embedding = tf.matmul(tf.one_hot(self.u_j, depth=args.num_of_nodes), self.context_embedding)

        # ui 和 uj的内积
        self.inner_product = tf.reduce_sum(self.u_i_embedding * self.u_j_embedding, axis=1)
        self.loss = -tf.reduce_mean(tf.log_sigmoid(self.label * self.inner_product))
        self.learning_rate = tf.placeholder(name='learning_rate', dtype=tf.float32)
        # self.optimizer = tf.train.GradientDescentOptimizer(learning_rate=self.learning_rate)
        self.optimizer = tf.train.RMSPropOptimizer(learning_rate=self.learning_rate)
        self.train_op = self.optimizer.minimize(self.loss)
  • line.py

import tensorflow as tf
import numpy as np
import argparse
from model import LINEModel
from utils import DBLPDataLoader
import pickle
import time


def main():
  	# 读取操作参数
    parser = argparse.ArgumentParser()
    # embedding 维度
    parser.add_argument('--embedding_dim', default=128)
    # batch_size 批的大小
    parser.add_argument('--batch_size', default=128)
    # 负采样的个数
    parser.add_argument('--K', default=5)
    # 相似度
    parser.add_argument('--proximity', default='second-order', help='first-order or second-order')
    # 学习率
    parser.add_argument('--learning_rate', default=0.025)
    # 模式
    parser.add_argument('--mode', default='train')
    parser.add_argument('--num_batches', default=300000)
    parser.add_argument('--total_graph', default=True)
    parser.add_argument('--graph_file', default='data/co-authorship_graph.pkl')
    args = parser.parse_args()
    if args.mode == 'train':
        train(args)
    elif args.mode == 'test':
        test(args)


def train(args):
  	# 读取图文件
    data_loader = DBLPDataLoader(graph_file=args.graph_file)
    # 一阶or二阶
    suffix = args.proximity
    # 赋值nodes的个数
    args.num_of_nodes = data_loader.num_of_nodes
    # 建立模型
    model = LINEModel(args)
    
    with tf.Session() as sess:
        print(args)
        print('batches\tloss\tsampling time\ttraining_time\tdatetime')
        tf.global_variables_initializer().run()
        initial_embedding = sess.run(model.embedding)
        learning_rate = args.learning_rate
        sampling_time, training_time = 0, 0
        for b in range(args.num_batches):
            t1 = time.time()
            u_i, u_j, label = data_loader.fetch_batch(batch_size=args.batch_size, K=args.K)
            feed_dict = {model.u_i: u_i, model.u_j: u_j, model.label: label, model.learning_rate: learning_rate}
            t2 = time.time()
            sampling_time += t2 - t1
            if b % 100 != 0:
                sess.run(model.train_op, feed_dict=feed_dict)
                training_time += time.time() - t2
                if learning_rate > args.learning_rate * 0.0001:
                    learning_rate = args.learning_rate * (1 - b / args.num_batches)
                else:
                    learning_rate = args.learning_rate * 0.0001
            else:
                loss = sess.run(model.loss, feed_dict=feed_dict)
                print('%d\t%f\t%0.2f\t%0.2f\t%s' % (b, loss, sampling_time, training_time,
                                                    time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())))
                sampling_time, training_time = 0, 0
            if b % 1000 == 0 or b == (args.num_batches - 1):
                embedding = sess.run(model.embedding)
                normalized_embedding = embedding / np.linalg.norm(embedding, axis=1, keepdims=True)
                pickle.dump(data_loader.embedding_mapping(normalized_embedding),
                            open('data/embedding_%s.pkl' % suffix, 'wb'))


def test(args):
    pass

if __name__ == '__main__':
    main()
  • 2
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 8
    评论
评论 8
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值