GraphSage代码阅读笔记（TensorFlow版）_代码阅读 ├─-CSDN博客

本文链接：https://blog.csdn.net/LIYUO94/article/details/105764536

本文详细解读了GraphSage的TensorFlow版本，包括论文和代码链接、文件结构、数据解析、环境安装配置及核心代码部分，如unsupervised_train.py和utils.py。通过对toy-ppi数据集的解析，展示了GraphSage如何处理节点分类任务，并提供了环境搭建和运行代码的指导。

摘要由CSDN通过智能技术生成

GraphSage代码阅读笔记（TensorFlow版）目录

一、论文、代码链接

Inductive Representation Learning on Large Graphs论文链接

graphsage代码链接

二、文件结构

1.文件目录
├──eval_scripts //验证集
├──example_data //ppi数据集
└──graphsage//模型结构定义、GCN层定义、……

2.eval_scripts //验证集目录内容
├──citation_eval.py
├──ppi_eval.py
└──reddit_eval.py

3.example_data //ppi数据集
├──toy-ppi-class_map.json //图节点id映射到类。
├──toy-ppi-feats.npy //预训练好得到的features
├──toy-ppi-G.json //图的信息
├──ttoy-ppi-walks//从一点出发随机游走到邻居节点的情况，对于每个点取198次
└──toy-ppi-id_map.json //节点编号与序号的一一对应

4.graphsage//模型结构定义
├── init //导入模块
├──aggregators // 聚合函数定义
├──inits.py // 初始化的一些公用函数
├── layers // GCN层的定义
├── metrics // 评测指标的计算
├── minibatch//minibatch iterator函数定义
├── models // 各种模型结构定义
├── neigh_samplers //定义从节点的邻居中采样的采样器
├── prediction//
├── supervised_models
├── supervised_train
├── unsupervised_train
└── utils // 工具函数的定义

三、数据解析

3.1.数据集信息表：

表一：
在这里插入图片描述
表二：

注：表一摘自https://blog.csdn.net/yyl424525/article/details/102966617

3.2数据文件解析

1.toy-ppi-G.json //图的信息
数据中只有一个图，用来做节点分类任务。
图为无向图，由nodes集和links集合构成，每个集合都是一个list，里面包含的每一个node或link都是词典形式存储的
在这里插入图片描述
数据格式：

注：（上图摘自：https://blog.csdn.net/yyl424525/article/details/102966617
）

2.toy-ppi-class_map.json，图节点id映射到类。格式为：{“0”: [1, 0, 0,…],…,“14754”: [1, 1, 0, 0,…]}
在这里插入图片描述
3.toy-ppi-id_map.jsontoy-ppi-id_map.json //节点编号与序号的一一对应
数据格式为：{“0”: 0, “1”: 1,…, “14754”: 14754}

4.toy-ppi-feats.npy //预训练好得到的features

numpy存储的节点特征数组；由id_map.json给出的顺序。可以省略，仅使用身份功能。
5.toy-ppi-walks.txt

四、代码运行及其环境安装与配置

4.1环境安装

1.可通过pip install -r requirements.txt 安装以下版本的包

absl-py==0.2.2
astor==0.6.2
backports.weakref==1.0.post1
bleach==1.5.0
decorator==4.3.0
enum34==1.1.6
funcsigs==1.0.2
futures==3.1.0
gast==0.2.0
grpcio==1.12.1
html5lib==0.9999999
Markdown==2.6.11
mock==2.0.0
networkx==1.11
numpy==1.14.5
pbr==4.0.4
protobuf==3.6.0
scikit-learn==0.19.1
scipy==1.1.0
six==1.11.0
sklearn==0.0
tensorboard==1.8.0
tensorflow==1.8.0
termcolor==1.1.0
Werkzeug==0.14.1

注：

networkx版本必须小于等于1.11，否则报错

在这里插入图片描述

Docker程序安装：
没有docker需要安装docker。
也可以在[docker]（https://docs.docker.com/）映像中运行GraphSage。克隆项目后，按如下所示构建并运行映像：
$ docker build -t graphsage .
$ docker run -it graphsage bash
还可以使用[nvidia-docker]（https://github.com/NVIDIA/nvidia-docker）运行GPU映像：
$ docker build -t graphsage:gpu -f Dockerfile.gpu .
$ nvidia-docker run -it graphsage:gpu bash

4.2 代码运行

1.在命令运行unsupervised_train.py

unsupervised_train.py 是用节点和节点的邻接信息做loss训练，训练好可以输出节点embedding，使用EdgeMinibatchIterator

python -m graphsage.unsupervised_train --train_prefix ./example_data/toy-ppi --model graphsage_mean --max_total_steps 1000 --validate_iter 10
#参考https://blog.csdn.net/yyl424525/article/details/102966617

model可选值：

* graphsage_mean -- GraphSage with mean-based aggregator
* graphsage_seq -- GraphSage with LSTM-based aggregator
* graphsage_maxpool -- GraphSage with max-pooling aggregator (as described in the NIPS 2017 paper)
* graphsage_meanpool -- GraphSage with mean-pooling aggregator (a variant of the pooling aggregator, where the element-wie mean replaces the element-wise max).
* gcn -- GraphSage with GCN-based aggregator
* n2v -- an implementation of [DeepWalk](https://arxiv.org/abs/1403.6652) (called n2v for short in the code.)

2.运行supervised_train.py，注意train_prefix参数的值也需要改:…/example_data/toy-ppi

python -m graphsage.supervised_train --train_prefix ./example_data/toy-ppi --model graphsage_mean --sigmoid

GraphSage代码阅读笔记（TensorFlow版）（一）

五、代码文件解析

5.1.init.py

from __future__ import print_function
'''即使在python2.X，使用print就得像python3.X那样加括号使用'''

from __future__ import division
'''导入python未来支持的语言特征division(精确除法)，
6 # 当我们没有在程序中导入该特征时，"/"操作符执行的是截断除法(Truncating Division)；
7 # 当我们导入精确除法之后，"/"执行的是精确除法, "//"执行截断除除法'''

5.2.unsupervised_train.py代码

# _*_ coding:UTF-8
#supervised_train.py 是用节点分类的label来做loss训练，不能输出节点embedding，使用NodeMinibatchIterator

#unsupervised_train.py 是用节点和节点的邻接信息做loss训练，训练好可以输出节点embedding，使用EdgeMinibatchIterator

from __future__ import division
'''即使在python2.X，使用print就得像python3.X那样加括号使用'''

from __future__ import print_function
'''导入python未来支持的语言特征division(精确除法)，
6 # 当我们没有在程序中导入该特征时，"/"操作符执行的是截断除法(Truncating Division)；
7 # 当我们导入精确除法之后，"/"执行的是精确除法, "//"执行截断除除法'''

import os#导入操作系统模块
import time#导入时间模块
import tensorflow as tf#导入TensorFlow模块
import numpy as np#导入numpy模块

from graphsage.models import SampleAndAggregate, SAGEInfo, Node2VecModel
from graphsage.minibatch import EdgeMinibatchIterator
from graphsage.neigh_samplers import UniformNeighborSampler
from graphsage.utils import load_data
'''如果服务器有多个GPU，tensorflow默认会全部使用。如果只想使用部分GPU，可以通过参数CUDA_VISIBLE_DEVICES来设置GPU的可见性。'''

os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"# 按照PCI_BUS_ID顺序从0开始排列GPU设备   # 使用哪一块gpu，本人只有一块，需将1改为0

'''Set random seed设置相同的seed，则每次生成的随机数也相同，如果不设置seed，则每次生成的随机数都会不一样。'''
seed = 123
np.random.seed(seed)#random是一个算法，设置随机数种子，再不同设备上生成的随机数一样。
tf.set_random_seed(seed)

# Settings
flags = tf.app.flags
FLAGS = flags.FLAGS  #构造了一个解析器FLAGS  这样就可以从命令行中传入数据，从外部定义参数，如python train.py --model gcn

tf.app.flags.DEFINE_boolean('log_device_placement', False,
                            """Whether to log device placement.""")#定义变量bool型。
#core params..#定义变量，通过命令行解析传入参数
flags.DEFINE_string('model', 'graphsage', 'model names. See README for possible values.')  #传入模型，模型名字等参数 
flags.DEFINE_float('learning_rate', 0.00001, 'initial learning rate.')
flags.DEFINE_string("model_size", "small", "Can be big or small; model specific def'ns")
flags.DEFINE_string('train_prefix', '', 'name of the object file that stores the training data. must be specified.')

# left to default values in main experiments 实验默认值
flags.DEFINE_integer('epochs', 1, 'number of epochs to train.')#迭代次数
flags.DEFINE_float('dropout', 0.0, 'dropout rate (1 - keep probability).')#dropout率 避免过拟合（按照一定的概率随机丢弃一部分神经元）
# loss计算方式（权值衰减+正则化）：self.loss += FLAGS.weight_decay * tf.nn.l2_loss(var)
flags.DEFINE_float('weight_decay', 0.0, 'weight for l2 loss on embedding matrix.')#权衰减 目的就是为了让权重减少到更小的值，在一定程度上减少模型过拟合的问题
flags.DEFINE_integer('max_degree', 100, 'maximum node degree.')#矩阵的度

flags.DEFINE_integer('samples_1', 25, 'number of samples in layer 1')#第一层采样节点数 k =1 s = 25
flags.DEFINE_integer('samples_2', 10, 'number of users samples in layer 2')#第二层采样节点数 k = 2 s = 10

#若有concat操作，则维度变为2倍
flags.DEFINE_integer('dim_1', 128, 'Size of output dim (final is 2x this, if using concat)')
flags.DEFINE_integer('dim_2', 128, 'Size of output dim (final is 2x this, if using concat)')
flags.DEFINE_boolean('random_context', True, 'Whether to use random context or direct edges')
flags.DEFINE_integer('neg_sample_size', 20, 'number of negative samples')#负采样数
flags.DEFINE_integer('batch_size', 512, 'minibatch size.')#bachsize
flags.DEFINE_integer('n2v_test_epochs', 1, 'Number of new SGD epochs for n2v.')n2v SGD迭代次数
flags.DEFINE_integer('identity_dim', 0, 'Set to positive value to use identity embedding features of that dimension. Default 0.')#设置负值，用于识别嵌入特征维度  默认为0

#logging, saving, validation settings etc.
flags.DEFINE_boolean('save_embeddings', True, 'whether to save embeddings for all nodes after training')#选择是否保存嵌入
flags.DEFINE_string('base_log_dir', '.', 'base directory for logging and saving embeddings')#用于记录和保存嵌入的基本目录
flags.DEFINE_integer('validate_iter', 5000, "how often to run a validation minibatch.")#验证集迭代次数
flags.DEFINE_integer('validate_batch_size', 256, "how many nodes per validation sample.")#验证集bach_size
flags.DEFINE_integer('gpu', 1, "which gpu to use.")#使用哪一个GPU,只有1块时需要改为0
flags.DEFINE_integer('print_every', 50, "How often to print training info.")#设置多久打印训练信息
flags.DEFINE_integer