[论文精读]Inductive Representation Learning on Large Graphs

夏莉莉iy

已于 2023-10-14 01:49:43 修改

阅读量210

点赞数

文章标签：人工智能深度学习计算机视觉学习机器学习

于 2023-10-11 17:03:52 首次发布

本文链接：https://blog.csdn.net/Sherlily/article/details/133773323

版权

论文原文：Inductive representation learning on large graphs | Proceedings of the 31st International Conference on Neural Information Processing Systems (acm.org)

英文是纯手打的！论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误，若有发现欢迎评论指正！文章偏向于笔记，谨慎食用！

2.4. Proposed method: GraphSAGE

2.4.1. Embedding generation (i.e., forward propagation) algorithm

2.4.2. Learning the parameters of GraphSAGE

2.4.3. Aggregator Architectures

2.5. Experiments

2.5.1. Inductive learning on evolving graphs: Citation and Reddit data

2.5.2. Generalizing across graphs: Protein-protein interactions

2.5.3. Runtime and parameter sensitivity

2.5.4. Summary comparison between the different aggregator architectures

2.6. Theoretical analysis

2.7. Conclusion

3. 知识补充

3.1. Hyperparameter hacking

4. Reference List

1. 省流版

1.1. 心得

（1）救命为什么我没有心得我已经飞升了？

1.2. 论文框架图

2. 论文逐段精读

2.1. Abstract

①The previous works do not concern unseen nodes, only focus on nodes which are presented or labeled

②They adopt inductive framework by sampling and aggregating features from neighbors

2.2. Introduction

①Node embedding aims to decrease the dimensionalities of neighbors' feature, then transfer features to embedding vectors.

②For predicting new-and-unknown nodes, inductive ability is necessary for ML/DL.

③They generalize traditional GCN to trainable aggregation functions with inductive unsupervised learning

④The rich node features, including degree, text attributes, node profile information, have the ability to generalize and predict other unknown nodes

⑤Every node is aggregated by different hops or search depths away from it.

⑥The authors test their model on three classification tasks and prove its prediction ability. Besides, their model gets higher accuracy.

⑦Their main idea:

2.3. Related work

（1）Factorization-based embedding approaches

①Node embedding methods: random walk statistics, matrix factorization etc.

②However, all of these are fix in space due to the same orthogonal transformations. This graph space does not have generalization that can be transferred to other graphs, and may be reconstructed during other training sessions

（2）Supervised learning over graphs

Previous supervised learnings on graphs mainly focus on entire graphs instead of single node

（3）Graph convolutional networks

①Previous GCNs can not generalize to entire or large graphs

②Original GCN is semi-supervised and transductive, and requires reseachers to know entire Laplacian all the time

2.4. Proposed method: GraphSAGE

2.4.1. Embedding generation (i.e., forward propagation) algorithm

①Assuming there are $K$ aggregator functions

②Each node aggregate information from neighbors is:

$\mathrm{AGGREGATE}_{k},\forall k\in\{1,...,K\}$

③Propagate information between layers or “search depths” utilize the weight matrices:

$\mathbf{W}^{k},\forall k\in\{1,...,K\}$

④Forward propagating pseudocode:

where $k$ represents the step, $h^{k}$ is node in the $k$ -th step. The fourth line indicates that each node vector $\mathbf{h}_{\mathcal{N}(v)}^{k}$ comes from an aggregation of other neighboring nodes. Then, in the fifth line, concatenating current node and its neighborhood vector to a fully connected layer with nonlinear activation function $\sigma$ . Lastly, the line 3 does not mean iterating all nodes. The author believes that only some nodes that meet the iteration conditions need to be retained, and the complete pseudocode is provided in the appendix.

⑤Traditional Weisfeiler-Lehman Isomorphism sets hush function as aggregator. If two subgraphs get the same output, it means they are isomorphic. The authors replace hush function to trainable neural network to present topological structure

⑥They adopt different uniform samples at each iteration to achieve faster training

placeholder n.(替代缺失部分的)占位符，占位文字;位标(句子中必要但无实际意义的词项，如 It's a pity she left 中的 it)

2.4.2. Learning the parameters of GraphSAGE

①The parameters adjust by stochastic gradient descent, here is the loss function:

$J_{\mathcal{G}}(\mathbf{z}_{u})=-\log\left(\sigma(\mathbf{z}_{u}^{\top}\mathbf{z}_{v})\right)-Q\cdot\mathbb{E}_{v_{n}\sim P_{n}(v)}\log\left(\sigma(-\mathbf{z}_{u}^{\top}\mathbf{z}_{v_{n}})\right)$

where $u$ is the neighbor of $v$ , $\sigma$ is Sigmoid function, $P_{n}$ denotes a negative sampling distribution, $Q$ denotes the number of negative samples

2.4.3. Aggregator Architectures

①The symmetry of aggregator is able to ignoring input sorting order, namely it will not be influenced by order.

②The first candidate aggregator is mean aggregator:

$\textbf{h}_{v}^{k}\leftarrow \sigma \left ( \textbf{W}\cdot \textbf{MEAN} \left ( \left \{ \textbf{h}_{v}^{k-1} \right \} \cup \left \{ \textbf{h}_{v}^{k-1},\forall u\in N\left ( v \right ) \right \}\right )\right )$

（它原文的式子好像括号就打错了）which can replace the 4-th and 5-th line of Algorithm 1. This aggregator omits concatenation operation, which is able to connect different layers or depths.

③The second candidate aggregator is LSTM aggregator, which has better expression ability. However, it is not symmetric.

④The third candidate aggregator is pooling aggregator:

$\text{AGGREGATE}_k^\text{pool}=\max(\left\{\sigma\left(\mathbf{W}_{\text{pool}}\mathbf{h}_{u_i}^k+\mathbf{b}\right),\forall u_i\in\mathcal{N}(v)\right\})$

which is symmetric and trainable. Before the max pooling function, there can be any layer of MPL. Moreover, it is found that there is no significant difference between mean and max operation.

lattice n.格子木架，格子金属架，格栅(用作篱笆等);斜条结构;斜格图案

2.5. Experiments

（1）They used 3 tasks to test the performance of GraphSAGE:

①Subject classification of academic journals using Web of Science citation dataset

②Community classification of Reddit posts

③Function classification of protein-protein interaction (PPI)

（2）Experimental set-up

①They compare four baselines: random classifer, logistic regression feature-based classifier, DeepWalk and combination of DeepWalk and raw features.

②Furthermore, the authors add four variants of GraphSAGE with different aggregator as comparison.

③Activation: ReLU

④ $K=2$

⑤Neighborhood sample sizes $S_{1}=25,S_{2}=10$

⑥Optimizer: Adam

⑦They choose same hyperparameters for all GraphSAGE variants

⑧Comparison results:

2.5.1. Inductive learning on evolving graphs: Citation and Reddit data

（1）Citation data:

①Dataset: Thomson Reuters Web of Science Core Collection

②Samples: all papers in 6 biology-related fields from 2000 to 2005

③Number of nodes: 302,424

④Average degree: 9.15

⑤Training set: 2000-2004

⑥Testing set: 70% in 2005

⑦Validating set: 30% in 2005

（2）Reddit data:

①Dataset: graph dataset from Reddit in 2014, 9

②Samples: 50 communities, 232,965 posts

③Average degree: 492

④Connecting post nodes once a person comments on two posts

（3）⭐Their model has no fine-tuning

2.5.2. Generalizing across graphs: Protein-protein interactions

①Graph: human tissue

②Features: positional gene sets, motif gene sets, immunological signatures

③Labels: 121 gene ontology sets

④Average number of nodes in one graph: 2372

⑤Average degree: 28.8

⑥Training set (graphs): 20

⑦Testing set: 2

2.5.3. Runtime and parameter sensitivity

①Experiments of training time on Reddit data:

②The figure is how neighborhood sample size influences performance:

while $K=2$ and $S_{1}=S_{2}$ . This is because they found that $K=2$ is generally the vertex with high slope growth in the accuracy improvement function, meaning that it grows slowly after it. On the other hand, the rate of return decreases with increasing neighborhood sample size.

2.5.4. Summary comparison between the different aggregator architectures

①Experiment sets: 6, 3 datasets with unsupervised and supervised respectively

②Test: non-parametric Wilcoxon Signed-Rank Test

2.6. Theoretical analysis

①They want to test whether the GraphSAGE is able to predict clustering coefficient of a node. The proportion of close triangles in 1-hop neighborhood of one node.

②There are features $x_{v}\in U,\forall v\in V$ . Then assuming there is a constant positive $C$ meets the requirement $\left \| x_{v}-x_{{v}'} \right \|_{2}> C$ when $v$ and $v{}'$ are paired. There is $\forall \epsilon > 0$ and after $K=4$ iterations:

$\begin{vmatrix}z_v-c_v\end{vmatrix}<\epsilon,\forall v\in\mathcal{V}$

where $z_{v}$ is final output and $c_{v}$ is noded clustering coefficients

2.7. Conclusion

By sampling neighbor nodes, GraphSAGE trades off training-testing time and performance to achieve high accuracy. What is more, the authors think directed graph, multi-modal or non-uniform neighborhood sampling is promising

3. 知识补充

3.1. Hyperparameter hacking

（1）好像是作者自己提出来的概念，并没有在网上搜到解释

4. Reference List

Hamilton, W., Ying, R. & Leskovec, J. (2017) 'Inductive representation learning on large graphs', 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, pp. 1025-1035. doi: https://doi.org/10.48550/arXiv.1706.02216