CS224W: Machine Learning with Graphs
Stanford / Winter 2021
04-pagerank
- Investigate graph analysis and learning from a matrix perspective (从矩阵的角度分析图)
PageRank (Google Algorithm)
背景:互联网中有非常多的网页,每个网页中又有若干跳转超链接,链接到另一个页面,从而所有的网页与这些超链接跳转关系形成了图结构。如何根据图结构得出各个网页的重要性程度排序是PageRank需要解决的问题
-
PageRank: The “Flow” Model
A “vote” from an important page is worth more (来自越重要的网页的超链接指向的网页越重要)
-
Each link’s vote is proportional to the importance of its source page
-
If page i i i with importance r i r_i ri has d i d_i di out-links, each link gets r i / d i r_i / d_i ri/di votes
-
Page j j j’s own importance r j r_j rj is the sum of the votes on its in-links
-
A page is important if it is pointed to by other important pages
-
Define “rank” r j r_j rj for node j j j
r j = ∑ i → j r i d i r_{j}=\sum_{i \rightarrow j} \frac{r_{i}}{d_{i}} rj=i→j∑diri
d i d_i di: out-degree of node i i i
-
-
Example and Matrix Formulation
-
Solving this system of equation using elimination is not efficient (直接消除变量解方程效率低)
-
Stochastic adjacency Matrix M M M
-
Let page j j j have d j d_j dj out-links
-
If j → i j \rightarrow i j→i, then M i j = 1 d j M_{ij} = \frac{1}{d_j} Mij=dj1
-
M M M is a column stochastic matrix: Columns sum to 1
-
-
Rank vector r r r: An entry per page (所有页面的重要性向量)
-
r i r_i ri is the importance score of page i i i
∑ i r i = 1 \sum_{i} r_{i}=1 i∑ri=1
-
-
The flow equation can be written as
r = M ⋅ r \boldsymbol{r}=\boldsymbol{M} \cdot \boldsymbol{r} r=M⋅r
-
-
Connection to Random Walk
-
流程
-
在 t t t时刻,正在浏览某页面 i i i
-
在 t + 1 t+1 t+1时刻,以out-link数量均匀概率随机选择一条边进行转移到下一个页面
-
最后终止到某个页面 j j j
-
重复上述流程
-
-
定义 p ( t ) p(t) p(t): 第 i i i个分量表示在时刻 t t t,正在访问各个页面的概率 (Probability distribution over pages)
p ( t + 1 ) = M ⋅ p ( t ) p(t+1)=M \cdot p(t) p(t+1)=M⋅p(t)
-
当达到如下状态时, p ( t ) p(t) p(t)为随机游走的稳定分布
p ( t + 1 ) = M ⋅ p ( t ) = p ( t ) p(t+1)=M \cdot p(t)=p(t) p(t+1)=M⋅p(t)=p(t)
-
Solve PageRank: Power Iteration
-
特征值角度
1 ⋅ r = M ⋅ r 1 \cdot \boldsymbol{r}=\boldsymbol{M} \cdot \boldsymbol{r} 1⋅r=M⋅r
-
r r r是随机邻接矩阵 M M M对应于特征值 1 1 1的特征向量
-
解特征向量对于large-scale graph计算非常慢 (节点1000往上就非常慢了)
-
-
Algorithm
-
Initialize
r 0 = [ 1 / N , … , 1 / N ] T \boldsymbol{r}^{0}=[1 / N, \ldots, 1 / N]^{T} r0=[1/N,…,1/N]T
-
Iterate
r ( t + 1 ) = M ⋅ r t \boldsymbol{r}^{(t+1)}=\boldsymbol{M} \cdot \boldsymbol{r}^{t} r(t+1)=M⋅rt
-
Stop when ( L 2 L_2 L2 Norm以及其他合理的也行)
∣ r ( t + 1 ) − r t ∣ 1 < ε \left|\boldsymbol{r}^{(\boldsymbol{t}+\mathbf{1})}-\boldsymbol{r}^{t}\right|_{1}<\varepsilon r(t+1)−rt 1<ε
-
PageRank: Problems
-
Two problems
-
Some pages are dead ends (have no out-links), which will cause importance to “leak out” (一些页面没有出边,会造成重要性泄露,达不到收敛或无法获得准确的概率分布)
-
Spider traps: all out-links are within the group (所有的出边都在某一范围的节点组内,导致最终这些组内的节点吸收了所有的重要性)
-
-
Solution to Dead Ends
Teleport
-
当前时刻在该节点时,下一时刻以均匀概率转移到图中任意一个节点
-
无需修改核心算法,只需在预处理时,将随机邻接矩阵中元素全为0的列(对应没出边的情况)改为概率均匀分布即可
-
-
Solution to Spider Traps
Teleport
-
每一时间步时,随机游走具有两种选择
-
以 β \beta β的概率遵循原算法按重要性概率分布向量 r r r转移到其他页面
-
以 1 − β 1-\beta 1−β的概率随机跳转到任一页面( β \beta β一般取0.8~0.9)
-
-
在较短时间步内,能跳出Spider traps所在的范围
-
PageRank: Final Form
-
Dead Ends问题可在预处理时对随机邻接矩阵进行处理即可,无需修改算法。以下对于原始PageRank的改进默认Dead Ends情况已消除
-
Equation Form
r j = ∑ i → j β r i d i + ( 1 − β ) 1 N r_{j}=\sum_{i \rightarrow j} \beta \frac{r_{i}}{d_{i}}+(1-\beta) \frac{1}{N} rj=i→j∑βdiri+(1−β)N1
d i d_i di: out-degree of node i i i -
The Google Matrix G G G Form
G = β M + ( 1 − β ) [ 1 N ] N × N G=\beta M+(1-\beta)\left[\frac{1}{N}\right]_{N \times N} G=βM+(1−β)[N1]N×N
r = G ⋅ r \boldsymbol{r}=\boldsymbol{G} \cdot \boldsymbol{r} r=G⋅r
仍然可用Power Iteration
-
Personalized PageRank (PPR)
- 在PageRank以 1 − β 1-\beta 1−β概率随机传送到任一节点上进行了限制,PPR随机传送到部分节点组成的节点集 S S S中,而不是全部节点
Random Walks with Restarts
- 随机传送的节点集 S S S中只有单个节点
Matrix Factorization and Node Embeddings
Paper : Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec
-
Node Embedding与Random Walk都可以表示为Matrix Form
-
若认为两个节点相连即代表相似,那么邻接矩阵中的元素就代表两个节点间的相似度,故
z v T z u = A u , v \mathbf{z}_{v}^{\mathrm{T}} \mathbf{z}_{u}=A_{u, v} zvTzu=Au,v
- 所以,可以进一步对邻接矩阵进行分解,生成节点的Embedding矩阵
Z T Z = A \boldsymbol{Z}^{T} \boldsymbol{Z}=A ZTZ=A
- 上述问题可描述为一个凸优化问题
min Z ∥ A − Z T Z ∥ 2 \min _{\mathrm{Z}}\left\|\mathrm{A}-\boldsymbol{Z}^{T} \boldsymbol{Z}\right\|_{2} Zmin A−ZTZ 2
-
Random Walk框架也可以被描述为Matrix Form
log ( vol ( G ) ( 1 T ∑ r = 1 T ( D − 1 A ) r ) D − 1 ) − log b \log \left(\operatorname{vol}(G)\left(\frac{1}{T} \sum_{r=1}^{T}\left(D^{-1} A\right)^{r}\right) D^{-1}\right)-\log b log(vol(G)(T1r=1∑T(D−1A)r)D−1)−logb
Limitation of Node Embeddings via Matrix Factorization and Random Walks
-
Cannot obtain embeddings for nodes not in the training set (无法获得不在训练集中的节点的embeddings,需要在测试时重新计算)
-
Cannot capture structural similarity (无法捕获结构相似性)
-
节点1和节点11在结构上相似,但其embedding会非常不同(因为很难在一次随机游走中从1到达11,或从11到达1)
-
DeepWalk and node2vec do not capture structural similarity
-
-
Cannot utilize node, edge and graph features (无法利用节点本身的一些属性特征)
-
Solution to these limitations: Deep Representation Learning and Graph Neural Networks