图机器学习基础知识——CS224W（04-pagerank）

最新推荐文章于 2024-10-01 16:06:06 发布

ZreviaX

最新推荐文章于 2024-10-01 16:06:06 发布

阅读量754

点赞数 29

分类专栏：图机器学习基础知识文章标签：机器学习人工智能深度学习图卷积神经网络图机器学习

本文链接：https://blog.csdn.net/WindGrin_/article/details/137870383

版权

图机器学习基础知识专栏收录该内容

22 篇文章 1 订阅

订阅专栏

文章详细介绍了PageRank算法，从矩阵角度分析网页的重要性排序，涉及随机游走模型、StochasticAdjacencyMatrix、PowerIteration方法以及个人化PageRank的扩展。同时讨论了网络嵌入和矩阵分解在捕捉节点相似性方面的局限性，以及深度学习和图神经网络的解决方案。

摘要由CSDN通过智能技术生成

CS224W: Machine Learning with Graphs

Stanford / Winter 2021

04-pagerank

Investigate graph analysis and learning from a matrix perspective (从矩阵的角度分析图)

PageRank (Google Algorithm)

背景：互联网中有非常多的网页，每个网页中又有若干跳转超链接，链接到另一个页面，从而所有的网页与这些超链接跳转关系形成了图结构。如何根据图结构得出各个网页的重要性程度排序是PageRank需要解决的问题

PageRank: The “Flow” Model

A “vote” from an important page is worth more (来自越重要的网页的超链接指向的网页越重要)
- Each link’s vote is proportional to the importance of its source page
- If page $i$ with importance $r_i$ has $d_i$ out-links, each link gets $r_i / d_i$ votes
- Page $j$ ’s own importance $r_j$ is the sum of the votes on its in-links
- A page is important if it is pointed to by other important pages
- Define “rank” $r_j$ for node $j$
  
  $r_{j}=\sum_{i \rightarrow j} \frac{r_{i}}{d_{i}}$
  $d_i$ : out-degree of node $i$
Example and Matrix Formulation
- Solving this system of equation using elimination is not efficient (直接消除变量解方程效率低)
- Stochastic adjacency Matrix $M$
  - Let page $j$ have $d_j$ out-links
  - If $\rightarrow i$ , then $M_{ij} = \frac{1}{d_j}$
  - $M$ is a column stochastic matrix: Columns sum to 1
- Rank vector $r$ : An entry per page (所有页面的重要性向量)
  - $r_i$ is the importance score of page $i$
    
    $\sum_{i} r_{i}=1$
- The flow equation can be written as
  
  $\boldsymbol{r}=\boldsymbol{M} \cdot \boldsymbol{r}$
Connection to Random Walk
- 流程
  - 在 $t$ 时刻，正在浏览某页面 $i$
  - 在 $t + 1$ 时刻，以out-link数量均匀概率随机选择一条边进行转移到下一个页面
  - 最后终止到某个页面 $j$
  - 重复上述流程
- 定义 $p (t)$ : 第 $i$ 个分量表示在时刻 $t$ ，正在访问各个页面的概率 (Probability distribution over pages)
  
  $\cdot p(t)$
- 当达到如下状态时， $p (t)$ 为随机游走的稳定分布
  
  $\cdot p(t)=p(t)$

Solve PageRank: Power Iteration

特征值角度

$\cdot \boldsymbol{r}=\boldsymbol{M} \cdot \boldsymbol{r}$
- $r$ 是随机邻接矩阵 $M$ 对应于特征值 $1$ 的特征向量
- 解特征向量对于large-scale graph计算非常慢（节点1000往上就非常慢了）
Algorithm
- Initialize
  
  $\boldsymbol{r}^{0}=[1 / N, \ldots, 1 / N]^{T}$
- Iterate
  
  $\boldsymbol{r}^{(t+1)}=\boldsymbol{M} \cdot \boldsymbol{r}^{t}$
- Stop when ( $L_2$ Norm以及其他合理的也行)
  
  $\left|\boldsymbol{r}^{(\boldsymbol{t}+\mathbf{1})}-\boldsymbol{r}^{t}\right|_{1}<\varepsilon$

PageRank: Problems

Two problems
- Some pages are dead ends (have no out-links), which will cause importance to “leak out” (一些页面没有出边，会造成重要性泄露，达不到收敛或无法获得准确的概率分布)
- Spider traps: all out-links are within the group (所有的出边都在某一范围的节点组内，导致最终这些组内的节点吸收了所有的重要性)
Solution to Dead Ends

Teleport
- 当前时刻在该节点时，下一时刻以均匀概率转移到图中任意一个节点
- 无需修改核心算法，只需在预处理时，将随机邻接矩阵中元素全为0的列（对应没出边的情况）改为概率均匀分布即可
Solution to Spider Traps

Teleport
- 每一时间步时，随机游走具有两种选择
  - 以 $\beta$ 的概率遵循原算法按重要性概率分布向量 $r$ 转移到其他页面
  - 以 $1-\beta$ 的概率随机跳转到任一页面（ $\beta$ 一般取0.8~0.9）
- 在较短时间步内，能跳出Spider traps所在的范围

PageRank: Final Form

Dead Ends问题可在预处理时对随机邻接矩阵进行处理即可，无需修改算法。以下对于原始PageRank的改进默认Dead Ends情况已消除
- Equation Form
  
  $r_{j}=\sum_{i \rightarrow j} \beta \frac{r_{i}}{d_{i}}+(1-\beta) \frac{1}{N}$
  $d_i$ : out-degree of node $i$
- The Google Matrix $G$ Form
  
  $G=\beta M+(1-\beta)\left[\frac{1}{N}\right]_{N \times N}$
  
  $\boldsymbol{r}=\boldsymbol{G} \cdot \boldsymbol{r}$
  仍然可用Power Iteration

Personalized PageRank (PPR)

在PageRank以 $1-\beta$ 概率随机传送到任一节点上进行了限制，PPR随机传送到部分节点组成的节点集 $S$ 中，而不是全部节点

Random Walks with Restarts

随机传送的节点集 $S$ 中只有单个节点

在这里插入图片描述

Matrix Factorization and Node Embeddings

Paper : Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec

Node Embedding与Random Walk都可以表示为Matrix Form
若认为两个节点相连即代表相似，那么邻接矩阵中的元素就代表两个节点间的相似度，故

$\mathbf{z}_{v}^{\mathrm{T}} \mathbf{z}_{u}=A_{u, v}$
- 所以，可以进一步对邻接矩阵进行分解，生成节点的Embedding矩阵
$\boldsymbol{Z}^{T} \boldsymbol{Z}=A$
- 上述问题可描述为一个凸优化问题
$\min _{\mathrm{Z}}\left\|\mathrm{A}-\boldsymbol{Z}^{T} \boldsymbol{Z}\right\|_{2}$
Random Walk框架也可以被描述为Matrix Form

$\log \left(\operatorname{vol}(G)\left(\frac{1}{T} \sum_{r=1}^{T}\left(D^{-1} A\right)^{r}\right) D^{-1}\right)-\log b$

Limitation of Node Embeddings via Matrix Factorization and Random Walks

Cannot obtain embeddings for nodes not in the training set (无法获得不在训练集中的节点的embeddings，需要在测试时重新计算)
Cannot capture structural similarity (无法捕获结构相似性)
- 节点1和节点11在结构上相似，但其embedding会非常不同（因为很难在一次随机游走中从1到达11，或从11到达1）
- DeepWalk and node2vec do not capture structural similarity
Cannot utilize node, edge and graph features (无法利用节点本身的一些属性特征)
Solution to these limitations: Deep Representation Learning and Graph Neural Networks