Web Mining复习笔记4-Web Structure Mining

大白要努力啊

已于 2024-06-08 13:47:31 修改

阅读量326

点赞数 14

分类专栏：笔记文章标签：笔记 web Web Mining data mining 数据分析

于 2024-06-04 03:51:23 首次发布

本文链接：https://blog.csdn.net/weixin_45012798/article/details/139428236

版权

笔记专栏收录该内容

40 篇文章 13 订阅

订阅专栏

4. Web Structure Mining

Definition: Discovery and interpretation of patterns in the hyperlink structure of the Web; the social ties among actors that interact on the Web.

Typical sources of web graphs
(1) web crawls including HTML pages and hyperlinks
A hyperlink graph is a collection of hyperlinks between web pages which belong to web sites.

(2) social networks representing relations between actors
A social network is a set of relations (e.g. friendship, interest, data exchange) between social entities, i.e. members of a social system (actors).

(3) knowledge graphs that have been extracted from the Web
A knowledge graph is a set of relations having different types (e.g. located in, painted, is interested in, is a) between entities (Mona Lisa, Louvre, Da Vinci) belonging to classes (e.g. persons, paintings, museums, places, dates).

(4) other types of community data (discussion forums, email conversations, navigation paths …)

Web structure mining focuses on the structure, but is also often combined with content or usage mining techniques

4.1 Describing Graphs

4.1.1 Terminology and Metrics

4.1.1.1 Graph

Graph

graph

Directed and Undirected Graphs

4.1.1.2 Adjacency Matrix

Adjacency Matrix

Adjacency Matrices for Directed and Undirected Graphs

Weighted and Unweighted Graphs

4.1.1.3 Bipartite Graphs

Bipartite graph (or bigraph) is a graph whose vertices can be divided into two disjoint sets U and V such that every line connects a vertex in U to one in V; that is, U and V are independent sets.
Examples: movie/actor network, disease/symptom network, photo/tag network on Flickr, customer/product recommendations

Bipartite graph

4.1.1.4 Vertex, Arc and Edge Attributes

Vertex, Arc and Edge can have Attributes
Vertices are described by attributes in many real-world networks
Combining these attributes with measures such as degree often reveals interesting insights

4.1.1.5 Degree

Degree

Example: Degrees of Undirected and Directed Graphs

Degree Distribution

Power-Law Distributions

Maximal out-degrees are much smaller than maximal in-degrees.

Average Degree

4.1.1.6 Graph Density

Graph Density

Example: Graph Density

4.1.1.7 Clustering Coefficient

Clustering Coefficient

Example: Clustering Coefficient

4.1.1.8 Paths

Paths

4.1.1.8.1 Shortest Path/Distance

Shortest Path/Distance

4.1.1.8.2 Network Diameter and Average Distance

Network Diameter and Average Distance

Example: Diameter

Small-World Phenomenon: High average clustering coefficient & Small average shortest path length

Exercise

4.2 Prominence

Who are the “most important” actors in a social network?
Centrality
A central actor is one involved in many edges.
The edge direction is not considered.

Prestige
A prestigious actor is one who is the target of many arcs.
The direction of arcs is considered.

Possible interpretations
Centrality: „social power“ of an actor
Prestige: „reputation“ of an actor

4.2.1 Centrality

Which nodes are most ‘central’? - Calculated for undirected graph

Definition of ‘central’ varies by context / purpose:

Local measure: Degree centrality
Relative to rest of network:
Closeness centrality: how close one to others
Betweenness centrality: how important connect community through this guy

How evenly is centrality distributed among nodes?

Centralization
Graph-level view

4.2.1.1 Degree centrality

Measure centrality as the number of edges to other vertices in the graph.
Degree centrality

Example: Degree Centrality

4.2.1.3 Centralization

衡量网络节点中心性不平衡程度，即节点的中心性是否集中在少数几个节点上
Centralization

Example

4.2.1.4 Betweenness Centrality

In what contexts may degree be insufficient to describe centrality?

Ability to broker between groups
Likelihood that information originating from anywhere in the network reaches you

These use cases require measures that are relative to the rest of the network.

Intuition: How many pairs of individuals would have to go through you in order to reach one another in the minimum number of hops?

Assumptions:
Interactions between two non-adjacent actors might depend on the other actors in the set of actors, especially the actors who lie on the paths between the two nodes.

“Actor in the middle” between the others has some control over paths in the network – “interpersonal influence”.

Betweenness Centrality

Example: Betweenness Centrality on Toy Networks

Betweenness Centrality on Toy Networks

4.2.1.5 Closeness Centrality

The measure focuses on how close an actor is to all the other actors in the network. For instance to spread information or interact with others or to be reached by information that spreads through the network.

Closeness centrality is based on the length of the average shortest path between a vertex and all vertices in the graph.

Closeness Centrality

Example

4.2.1.6 Correlation of Centrality Metrics

Generally different centrality metrics will be positively correlated.
When they are not, there is likely something interesting about the vertex.

4.2.2 Prestige

Prestige refers to a class of prominence metrics which take the direction of arcs into account. Translates to: choices received

Examples where direction matters: votes in an election, hyperlinks on the WWW, likes on TikTok, citations of scientific papers

Examples when ‘prestige’ may not be the right word: dislikes, distrusts

4.2.2.1 Degree Prestige / Popularity

The simplest vertex-level measure of prestige: in-degree
The idea is that actors who are prestigious tend to receive many nominations or choices

Local measure as only the neighbors are taken into account.

Degree Prestige

4.2.2.2 Input Domain

Degree prestige only counts actors who are directly adjacent to actor ni, but we might also want to take indirect choices into account. The input domain of a vertex in a directed network is the number or percentage of all other vertices that are connected by a path to this vertex. Also called influence domain.

4.2.2.3 Proximity Prestige

Prestige measure based on distances in the input domain.

Direct nominations (choices) should count more than indirect ones
Nominations from second degree neighbors should count more than third degree ones

Proximity Prestige

Example

Example2

4.2.2.4 Rank Prestige and Page Rank

Rank Prestige and Page Rank

4.3 Community Detection

A community is a set of actors between which interactions are (relatively) frequent. Communities are also called groups, cohesive subgroups, clusters, or modules in different contexts. Finding a community in a social network is to identify a set of nodes such that they interact with each other more frequently than with those nodes outside the group

Why analyze communities?
Analyzing communities helps to better understand users
-> patters might be clearer on group level

Groups provide a clear global view of user interactions
-> network compression
-> visualization of huge networks

Comparing individual behavior and behavior expected by the group might lead to interesting insights
-> some republican can agree with some democrats, but their parties disagree

The perception of community is task dependent.

4.3.1 Connected Components and K-Cores

Connected graph: Any two vertices of an undirected graph are joined by a path. A disconnected graph is made up by two or more connected components.
Bridge: if we remove it, the graph becomes disconnected (red line).
Cut vertex: if we remove it, the graph becomes disconnected (F and A).

Graph Connectivity and Components

If the largest component encompasses a significant fraction of the graph, it is called the giant component.
The other smaller components are called isolates.

Connectivity of Directed Graphs
Strongly connected directed graph: has a path from each vertex to every other vertex and vice versa (e.g. AB path and BA path).
Weakly connected directed graph: is connected if we disregard the arc directions.

Connectivity of Directed Graphs

4.3.1.1 Connected Components of Directed Graphs

Connected Components of Directed Graphs

Example

Bow-Tie

4.3.1.2 K-Cores

K-Cores

K-Cores and Communities

Example

4.3.1.3 Cliques

A clique is the maximal complete subgraph containing three vertices or more.
clique
Cliques are the strongest form of community as all vertices need to be directly adjacent to each other.

Problems:
The definition is often too restrictive to detect communities in real-world networks (result: large number of very small cliques)
Finding cliques in larger graphs is computationally expensive

4.3.1.4 Overlapping Complete Subgraphs

Alternative less restrictive approach: Consider overlapping complete subgraphs (triads) as cohesive group.
Triads overlap if they share one or more vertices.

Overlapping Complete Subgraphs

Procedure for the Analysis of Cohesive Subgroup
Idea: Successively increase the minimal density required.
After each step: Check if discovered groups make sense based on your knowledge about the application domain.

Procedure for the Analysis of Cohesive Subgroup

4.3.2 Clustering-based Techniques

4.3.2.1 Hierarchical Communities

Previous methods consider communities at a single level

Communities may form hierarchies (Each community can have sub/super communities)
Hierarchical clustering deals with this scenario and generates community hierarchies

Initially n members are considered as either 1 or n
communities in hierarchical clustering.

These communities are gradually merged (agglomerative hierarchical clustering) or split (divisive hierarchical clustering)

Divisive Hierarchical Clustering
Goal: Build a hierarchical structure of communities based on the network topology.
Allows the analysis of a network at different resolutions

Approach: Recursively remove the “weakest” edge

Find the edge with the least strength
Remove (cut) the edge and update the corresponding strength of each edge
Recursively apply the above two steps until a network is decomposed into desired number of components.
Consider each component to form a community.

The strength of an edge can be measured by edge betweenness(number of shortest paths that pass through the edge).

Divisive Clustering based on Edge Betweenness

4.3.2.2 Islands

Islands

4.3.2.3 Communities based on Vertex Attributes

Sometimes it makes sense to form communities based on member attributes and not based on frequent interaction.

Communities based on Vertex Attributes

Summary: Community Detection
The different approaches use different features for clustering vertexes:

Components: Vertex connectivity
K-Cores: Vertex degree
Cliques: Maximal density
Overlapping complete subgraphs: high density and bridges
Divisive hierarchical clustering: Betweenness on shortest paths
Islands: Edge weights
Vertex Attributes

Which feature/approach fits depends on the concrete use case

4.4 Machine Learning with Graphs

Goal: Learn a model from a graph for predicting missing edges or graph or node properties.

Machine learning tasks on graphs

4.4.1 Link Prediction and Node Classification

4.4.1.1 Link Prediction

The task is to predict new links based on existing links

Applications:
friend suggestions on Facebook
knowledge graph completion
product recommendation

4.4.1.2 Node Prediction

The task is to predict a vertex attribute value based on its neighborhood

Applications:
type prediction in knowledge graphs
molecule property prediction: cancer beating?
bot detection in social networks
topic detection in citation graphs

4.4.1.3 Feature Engineering

Feature Engineering

Graph Representation Learning

4.4.2 Node Embeddings

Learning Node Embeddings

Example of Node2Vec Embeddings

4.4.3 Graph Neural Networks

Intuition: Nodes aggregate information from their neighbors using neural networks

Graph Neural Networks
Supervised Training of a GNN

Variants of Graph Neural Networks

Graph Convolutional Networks (GCN)
use simple permutation invariant aggregation functions, e.g. sum(), mean() or max()
sum result with embedding of node itself
GraphSAGE
subsamples neighbors to speed up computation
offers more expressive aggregations, e.g. LSTM
Graph Attention Networks (GAT)
aggregation uses multi-headed attention to focus on messages from relevant neighbors
current state-of-the-art architecture

Implementation: PyG (PyTorch Geometric) - Python library that implements various GNNs

Graph Data Repositories and Benchmarks

Stanford Large Network Dataset Collection
social networks, e-commerce networks, citation networks
https://snap.stanford.edu/data/
Common Crawl
regularly publishes host- and domain-level web graphs
host graph May 2021: 515M nodes and 2.82B edges
https://commoncrawl.org/connect/blog/
Scientific Network Data Repository
networks from 30+ categories / disciplines
https://networkrepository.com/
Benchmarks
Node Classification: https://paperswithcode.com/task/node-classification
Link Prediction: https://paperswithcode.com/task/link-prediction