Node Measures and Computation in Social Media Analysis

最新推荐文章于 2024-08-05 20:49:33 发布

dtxwhzw

最新推荐文章于 2024-08-05 20:49:33 发布

阅读量1.5k

点赞数

文章标签：算法 python 搜索引擎大数据数据挖掘

本文链接：https://blog.csdn.net/dtxwhzw/article/details/108014139

版权

Node Measures and Computation

1.Node Centrality

1. Geometric Centrality Measures

(In)Degree Centrality

The Number of incoming links
$C_{def}(x) = d_{in}(x)$

Equivalently to the number of nodes at distance one or to the majority voting.

Closeness Centrality

Nodes that are more central have smaller distances
$C_{}clos(c) = \frac{1}{\sum d{(y,x)}}$

$d (y, x)$ is the length of ther shortest path from y to x

Nodes that are more central have smaller distances to other nodes and higher centrality.

The graph must be (strongly) connected!

In the mathematical theory of directed graphs, a graph is said to be strongly connected if every vertex is reachable from every other vertex. The strongly connected components of an arbitrary directed graph form a partition into subgraphs that are themselves strongly connected. It is possible to test the strong connectivity of a graph, or to find its strongly connected components, in linear time (that is, Θ(V+E)).
In graph theory, a component, sometimes called a connected component, of an undirected graph is a subgraph in which any two vertices are connected to each other by paths, and which is connected to no additional vertices in the supergraph.

Harmonic Centrality

$C_{har}(x) = \sum_{y \neq x} \frac{1}{d(y,x)} = \sum_{d(y,x) < \infin , y \neq x} \frac{1}{d(y,x)}$

Strongly correalted to closeness centrality
Naturally aslo accounts for nodes y that cannot reach x
Can be applied to graphs that are Note Strongly Connected

Harmonic Centrality can be normalized by dividing by N-1, where N is the number of nodes in the graph:
$C_{char}(x) = \frac{1}{n-1} \sum_{y \neq x} \frac{1}{d(y,x)} = \frac{1}{n-1} \sum_{d(y,x) < \infin , y \neq x} \frac{1}{d(y,x)}$

2. Spectral Centrality Measures

Eigenvevtor Centrality

In degree centrality, we consider nodes with more connections to be more important. However, in real-world scenarios, having more friends does not by itself guarantee that someone is important: having more important friends provides a stronger signal.

Eigenvector centrality tries to generalize degree centrality by incorporat- ing the importance of the neighbors (or incoming neighbors in directed graphs). It is defined for both directed and undirected graphs. To keep track of neighbors, we can use the adjacency matrix A of a graph.

We assume the eigenvector centrality of a node is $C_e(V_i)$
We would like $C_e(V_i)$ to be higher when important neighbors point to in.

- for incoming neighbors $A_{j,i} = 1$
Each node starts with the same score, and then each node gives away its score to its successors, and then normai
$C_e(V_i) = \frac{1}{\lambda} \sum_{j=1}^{n} A_{j,i}C_e(V_j)$

$\lambda $ is the norm of the centrality vector of all nodes.

Let $C_e = (C_e(V_1),C_e(V_2), \dots , C_e(V_n))^T$ $ \lambda C_e = A^T C_e $

This means that $C_{e}$ `is an eigenvector of adjacency matrix $A_T$ and $\lambda$ is the corresponding eigenvalue.

We should choose the largest eigenvalue and the corresponding eigenvector. And the lagrest value in this vector is the most central node.

For example, here’s an undirected graph:

So this is the adjacency matrix:

Let’s calculate the eigenvector and eigenvalue by Matlab:

We can find that the 2.6855 is the largest eigenvalue so we normalize the corresponding eigenvector and find the second value is the largest. So node $V_2$ is the most central nodes.

Katz Centrality

A major problem with eigen vector centrality arise when it deals with directed graphs. For nodes without incoming edges in a directed graph, their centrality valurs are zero. Eigenvector centrality only considers the effect of network topology structure and cannot capture the external knowledge. To resolve this problem we add bias term $\beta$ to the centrality values for all nodes:
$C_{Katz}(V_i) = \alpha \sum_{j=1}^n A_{j,i}C_{Katz}(V_j)+\beta$
Katz Centrality:
$C_{Katz} = \beta (I - \alpha A^T) ^ {-1} · 1$
$\alpha < \frac{1}{ \lambda }` is selected so that the matrix is inertible.

Let’s take this graph for example:

The adjacency matrix is

A = 
\begin{bmatrix}
 0 &1  &1  &1  &0 \\ 
 1 &0  &1  &1  &1 \\ 
 1 &1  &0  &1  &1 \\ 
 1 &1  &1  &0  &0 \\ 
 0 &1  &1  &0  &0 
\end{bmatrix} = A^T

The Eigenvalues are -1.68, -1.0, -1.0, 0.35, 3.32.
We assume $\alpha < 1/3.32$ and $\beta$ = 0.2

C_{Katz} = \beta (I-\alpha A^T)^{-1}·1 = 
\begin{bmatrix}
1.14\\ 
1.31\\ 
1.31\\ 
1.14\\ 
0.85
\end{bmatrix}

So the node $V_2$ and node $V_3$ are the most important nodes.

PageRank

In directed graphs, once a ndoe becomes an authortity(high centrality), it passes all its centrality along all of its out-links. This is less desirable since not everyone known by a well-known person is well-knows.

So we can divide the value of passed centrality by the number of outgoing links. Each connected neighbor gets a fraction of the source node’s centrality.

C_p = \beta (I - \alpha A^TD^{-1})^{-1} · 1

This is similar to Katz Centrality, $ \alpha < 1 / \lambda $ ,where $\lambda$ is the largest eigenvalue of $A^TD^{-1}$ . In undirected graphs, the largest eigenvalue of $A^TD^{-1}$ . In undirected graphs, the largest eigenvalue of $AA^TD^{-1}$ is $\lambda = 1$ ; therefore , $\alpha < 1$ . $\beta$ is often set as $1-\alpha$ .

And D is the degree matrix. $D = [d_{ii}], d_{ii}$ is the degree of node i. For example in this graph:

The degree matrix is

\begin{bmatrix}
3 &0  &0  & 0 & 0 & 0\\ 
0&  2&  0&  0&  0& 0\\ 
 0&  0&  3& 0 & 0 & 0\\ 
0 & 0 & 0 &  3&  0&0 \\ 
0 &0  & 0 & 0 &  0&0 \\ 
0 &  0& 0 &  0&  0& 2
\end{bmatrix}

The inverse of matrix D will also be a diagonal n x n matrix in the floowing form:

\begin{bmatrix}
\frac{1}{d_1} &0  &\dots& 0\\ 
0&  \frac{1}{d_2}&  0&  0\\ 
\vdots&  0&  \ddots& \vdots \\ 
0 & 0 & \dots &\frac{1}{d_n} \\
\end{bmatrix}

When the out-degree is zero, in this situation, the $A_{j,i} = 0$ . This makes the term inside the summation $\frac{0}{0}$ . We can fix this problem by setting $d_{j}^{out} = 1$ since the node will not contribute any centrality to any other nodes.

Here is a PageRank example:

A = 
\begin{bmatrix}
 0 &1  &0  &1  &1 \\ 
 1 &0  &1  &0  &1 \\ 
 0 &1  &0  &1  &1 \\ 
 1 &0  &1  &0  &0 \\ 
 1 &1  &1  &0  &0 
\end{bmatrix}

D = 
\begin{bmatrix}
 3 &0  &0  &0  &0\\ 
 0 &3  &0  &0  &0\\ 
 0 &0  &3  &0  &0\\ 
 0 &0  &0  &2  &0 \\ 
 0 &0  &0  &0  &3
\end{bmatrix}

C_p = \beta (I - \alpha A^TD^{-1})^{-1} ·1 = 
\begin{bmatrix}
2.14\\
2.13\\
2.14\\
1.45\\
2.13
\end{bmatrix}

So the node $V_1$ and $V_3$ are most important.

Hits Centrality

In a directed graph, a node is more important if it has more link.

Each node has 2 scores:

Quality as an expert(hub枢纽): Total sum of votes(authority scores) of nodes that it points to
Quality as a content provider(autority权重): Total sum of votes(hub scores) from nodes that point to it

Authorities are nodes containing useful information, for instance the newspaper home page the search engine home page etc.

Hubs are nodes that link to autorities, for example Hao123.com

When we counting in-links authority. Each hub page startes with hub score 1 authorities collect their votes. And hubs collect authority. And authorities collect hub scores again.

A good hub links to mang good authorities
A good authority is linked from many good hubs.
We can use the former two properities and based on the mutual reinforcement method to do iteration. Each iteration will update these two values, until they stop changing.

C_{aut}(x) = \sum _{y\rightarrow x}c_{hub}(y)

C_{hub}(c) = \sum _{x\rightarrow y}c_{aut}(y)

++HIts algorithm:++

Each page i has 2 scores:

Authority score: $a_i$
Hub score: $h_i$
Convergence criteria:

\sum _{i}(h_i^{(t)}-h_i^{(t+1)})^2 < \epsilon

\sum _{i}(a_i^{(t)}-a_i^{(t+1)})^2 < \epsilon

Initialize: $a^{(0)}_j = 1/ \sqrt n$ , $h_j^{(0)} = 1/\sqrt n$
Theen keep iterating until conveergence:

\forall i: Autority: a_i^{(t+1)} = \sum _{j\rightarrow i} h_j^{(t)}

\forall i: Hub: h_i^{(t+1)} = \sum _{j\rightarrow i} a_j^{(t)}

\forall i: Normalize: a_i^{(t+1)} = a_i^{(t+1)}/\sqrt{\sum_i (a_i ^{(T+1)})^2}

h_i^{(t+1)} = h_i^{(t+1)}/\sqrt{\sum_i (h_i ^{(T+1)})^2}

Hits in the vector notation:
vector $a = (a_1 \dots , a_n), h = (h_1 \dots, h_n)$
Adjacency matrix A(n x n): $A_{ij} = 1$ if $i \rightarrow j$
Can rewrite $h_i = \sum_{i \rightarrow j} a_j$ as $h_i = \sum_{j} A_{ij}·a_j$
So: h = A·a and simlilarly: $a = A^T·h$
Repeat until convergence:
$h(t+1) = A·a(t)$
$a^(t+1) = A^T·h^{(t)}$
Normalize $a^{(t+1)}$ and $h^{(t+1)}$

The steady state (HITS has converged) is:

a = A^T ·(A·a)/c^`

A^T·A·a = c^` · a

A·A^T ·h = c^{``} · h

So, authority a is eigenveotor of $A^T A$ (associated with the largest eigenvalue) Similarly: hub h is eigenvector of $AA^T$ .

The shortage of HITs

计算效率较低

因为HITS算法是与查询相关的算法，所以必须在接收到用户查询后实时进行计算，而HITS算法本身需要进行很多轮迭代计算才能获得最终结果，这导致其计算效率较低，这是实际应用时必须慎重考虑的问题。
主题飘移问题

如果在扩展网页集合里包含部分与查询主题无关的页面，而且这些页面之间有较多的相互链接指向，那么使用HITS算法很可能会给予这些无关网页很高的排名，导致搜索结果发生主题漂移，这种现象被称为“紧密链接社区现象”（Tightly-Knit CommunityEffect）。
易被作弊者操纵结果

HITS从机制上很容易被作弊者操纵，比如作弊者可以建立一个网页，页面内容增加很多指向高质量网页或者著名网站的网址，这就是一个很好的Hub页面，之后作弊者再将这个网页链接指向作弊网页，于是可以提升作弊网页的Authority得分。
结构不稳定

所谓结构不稳定，就是说在原有的“扩充网页集合”内，如果添加删除个别网页或者改变少数链接关系，则HITS算法的排名结果就会有非常大的改变。

The difference between HITS and PageRank

HITS算法和PageRank算法可以说是搜索引擎链接分析的两个最基础且最重要的算法。从以上对两个算法的介绍可以看出，两者无论是在基本概念模型还是计算思路以及技术实现细节都有很大的不同，下面对两者之间的差异进行逐一说明.

1.HITS算法是与用户输入的查询请求密切相关的，而PageRank与查询请求无关。所以，HITS算法可以单独作为相似性计算评价标准，而PageRank必须结合内容相似性计算才可以用来对网页相关性进行评价；

2.HITS算法因为与用户查询密切相关，所以必须在接收到用户查询后实时进行计算，计算效率较低；而PageRank则可以在爬虫抓取完成后离线计算，在线直接使用计算结果，计算效率较高；

3.HITS算法的计算对象数量较少，只需计算扩展集合内网页之间的链接关系；而PageRank是全局性算法，对所有互联网页面节点进行处理；

4.从两者的计算效率和处理对象集合大小来比较，PageRank更适合部署在服务器端，而HITS算法更适合部署在客户端；

5.HITS算法存在主题泛化问题，所以更适合处理具体化的用户查询；而PageRank在处理宽泛的用户查询时更有优势；

6.HITS算法在计算时，对于每个页面需要计算两个分值，而PageRank只需计算一个分值即可；在搜索引擎领域，更重视HITS算法计算出的Authority权值，但是在很多应用HITS算法的其它领域，Hub分值也有很重要的作用；

7.从链接反作弊的角度来说，PageRank从机制上优于HITS算法，而HITS算法更易遭受链接作弊的影响。

8.HITS算法结构不稳定，当对“扩充网页集合”内链接关系作出很小改变，则对最终排名有很大影响；而PageRank相对HITS而言表现稳定，其根本原因在于PageRank计算时的“远程跳转”

Power Iteration Method

There is also a power iteration method to compute the eigen-centrality, Katz centrality and PageRank.

Eigen-centrality:
Set $C^{(0)} \leftarrow 1, k \leftarrow 1 $
1: $C^{(k)} \leftarrow A^TC^{(k-1)}$
2. $C^{(k)} = c^{(k)}/ \left \| c^{(k)} \right \|_2 \rightarrow \lambda$
3.if $ \left \| C^{(k)} - C^{(k-1)} \right \| > \epsilon$ :
4. $k \leftarrow k+1$ , goto step 1
Katz Centrality
Set $C^{(0)} \leftarrow 1, k \leftarrow 1 $
1: $C^{(k)} \leftarrow \alpha A^TC^{(k-1)} + \beta _1$
2.if $ \left \| C^{(k)} - C^{(k-1)} \right \| > \epsilon$ :
3. $k \leftarrow k+1$ , goto step 1
PageRank

Referring to the power iteration algorithm for Katz Centrality. Just replace the adjacency matrix by $A^TD^{-1}$ .

3.Path-based Measures of Centrality

Edge Betweenness

Number of shortest paths passing over the edge

Step to compute edge betweenness

Forward step: Count the number of shortest paths $\sigma_{ai} $ from A to all other nodes i of the network.
Backwared step: Compute betweenness: If there are multiple paths count them fractionally
Node betweenness Centrality

Another way of loking at centrality is by considering how important nodes are in connecting other nodes.

C_b(V_i) = \sum _{s \neq t\neq V_i} \frac{\sigma_{st} (V_i)}{\sigma_{st}}

$\sigma_{st}$ The number of shortest paths from vertex s to t - a.k.a. information pathways

$\sigma_{st} (V_i)$ The number of ++shortest path++ from s to t that pass through $V_i$

betweenness centrality example1:
betweenness centrality example2:
Brandes Algorithm

A better way to compute the node betweenness

C_b(V_i) = \sum _{s \neq t\neq V_i} \frac{\sigma_{st} (V_i)}{\sigma_{st}} = \sum_{s \neq V_i} \delta_s (V_i)

\delta_s(V_i) = \sum_{t \neq V_i} \frac{\sigma_{st}(V_i)}{\sigma_{st}}

$\delta_s(V_i)$ - the dependence of s to $V_i$
There exist a recurrence equation that can help us determine $\sigma_s(V_i)$

\delta_s(V_i) = \sum_{w:V_i \in pred(s,w)} \frac{\sigma_{SV_I}}{\sigma_{sw}} (1+\sigma_s(w))

pred(s,w) is the set of predecessors of w in the shortest paths from s to w.

- $V_i$ is one of parents nodes of w

- if w is not $V_i$ 's child node, it can be ignored.

3.Node Similarity Computation

Structural equivalence

- We look at the neighborhood shared by two nodes

- The size of this shared neighborhood defines how similar two nodes are.

Vertex Similarity

\sigma(v_i,v_j) = |N(v_i) \cap N(v_j)|

Jaccard Similarity

\sigma_{jaccard}(v_i,v_j) = \frac{|N(v_i) \cap N(v_j)|}{|N(v_i) \cup N(v_j)|}

Cosine Similarity

\sigma_{Cosine}(v_i,v_j) = \frac{|N(v_i) \cap N(v_j)|}{\sqrt {|N(v_i) || N(v_j)|}}

And here is a example of computer the similarity:

However, we often excludes the node itself in the neighborhood. In this situation, some connected nodes not sharing a neighbor will be assigned zero similarity. For example, there are two nodes a and b, they are connnected without any neighbors. If we exclude them then they doesn’t have any neighbor, and those similarity value will be zero.

So, in the above case, when we include them into calculation. The result will become as followed:

\sigma_{Jaccard}(v_2,v_5) = \frac{|[v_1,v_2,v_3,v_4] \cap [v_3,v_5,v_6]|}{|[v_1,v_2,v_3,v_4,v_5,v_6]|} = \frac{1}{6}

\sigma_{Cosine}(v_2,v_5) = \frac{|[v_1,v_2,v_3,v_4] \cap [v_3,v_5,v_6]|}{\sqrt{[v_1,v_2,v_3,v_4] || [v_3,v_5,v_6]}} = \frac{1}{\sqrt {12}} = 0.29

dtxwhzw

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Node Measures and Computation in Social Media Analysis

Node Measures and Computation1.Node Centrality1. Geometric Centrality Measures(In)Degree Centrality The Number of incoming linksC_{def}(x) = d_{in}(x) Equivalently to the number of nodes at distance one or to the majority voting.Closeness Cen
复制链接

扫一扫