[paper]https://arxiv.org/pdf/1710.10321.pdf
[code]https://github.com/benedekrozemberczki/GraphWaveMachine
-
abstract
In this paper, we develop GraphWave, a method that represents each node’s network neighborhood via a low-dimensional embedding by leveraging heat wavelet diffusion patterns. Instead of training on hand-selected features, GraphWave learns these embeddings in an unsupervised way.
graphwave这个方法基于热浪传播模式来表示每个节点的网络邻居。是个无监督学习方法。
-
present work
our approach learns a multidimensional structural embedding for each node based on the diffusion of a spectral graph wavelet centered at the node. Intuitively, each node propagates a unit of energy over the graph and characterizes its neighboring topology based on the response of the network to this probe.
基于以节点为中心的频谱图小波的扩散来学习每个节点的多维结构嵌入。每个节点会向周围传播一个能量单位。
主要的贡献:
-
完全非监督,不需要任何先验知识。
-
完整的数学证明,以前的方法都是启发式的,这篇论文作者使用大量篇幅证明使用GraphWave,结构等价/相似的节点具有近乎相同/相似的嵌入。
-
前提知识
[参考文章]https://zhuanlan.zhihu.com/p/50212921
-
spectral graph wavelets(图谱小波)
[参考文章]https://blog.csdn.net/sxf1061926959/article/details/53538105
热核特征(Heat Kernel signature,HKS)是用于形变三维形状分析的特征描述子,属于谱分析方法。对于三维形状上的每个点,HKS定义了它的特征向量用于表示点的局部和全局属性。其广泛应用于是三维分割、分类、结构探索、形状匹配和形状检索。
简单理解的话,热核特征是去计算三维模型表面的每个点,随时间变化后热量的剩余情况,因为每个点周围的情况是不一样的,这样的话,每个点假设都有一个相同的初始热量,随时间推移,因为点周边的情况不一样,那么热量扩散的速度也不一样,所以随着时间的变化,每个点的热量变化将会形成一条下降的曲线,再把这条曲线离散化,我们就可以得到一个点的热核特征。再按该方法去计算每个点的热核特征,我们就可以得到整个三维模型的热核特征,可以用一个大矩阵表示。
2. 特征函数
对于一个随机变量X,它的特征函数定义为 。特征函数由随机变量完全决定,并能完全表征一个随机变量,即可以表达一个随机变量的所有矩。因此,特征函数提供了一种研究随机变量的方法。在某些情况下,分布函数不是很方便,比如求多个独立随机变量和的分布时,用分布函数求解的话,涉及到多重卷积,非常苦难,而转换成特征函数(即傅里叶变换)就相对简单些。
-
算法过程
论文的思路是,对给定的图G的拉普拉斯矩阵,利用公式,可以求得其heat kernel(热核特征矩阵)。论文里称为spectral graph wavelets(谱图小波) ,作者将这个spectral graph wavelets看作一个概率分布,特征函数可以表征一个概率分布,就可以利用特征函数来表征一个spectral graph wavelets。特征函数在任意t上是相等的,则任意t采样即可得到GE。
对于一个无向图,其拉普拉斯矩阵为,其中D为度矩阵,A为邻接矩阵,U为特征向量,为特征值。其对应的spectral graph wavelets为。
对于某一个节点a的spectral graph wavelets,是节点a的one-hot向量。
表示从a收到的从m传来的能量。
若是a和b的结构相似,则他们的能量分布应该也是相似的。将看作一组随机变量,求其特征函数。,最后对其进行d次任意t的采样
Re表示实部,Im表示虚部,最后得到一个2d的a的embedding向量。当然,这样还不太好,因为只有一个参数s,s实际上控制着能力传播的距离,较小的s得到的表示小范围的结构相似性,较大的s得到的表示可以表示更大尺度的结构相似性。所以,文章使用J个s得到J个不同的表示,最后concat起来得到最终的表示是2*d*J维的。
-
数学证明比较复杂,我看不懂,直接看代码。
源码【code】
-
整体结构
-
main.py(主函数运行部分)(修改了带权图的读取方式)
"""Running the GraphWave machine."""
import pandas as pd
import networkx as nx
from param_parser import parameter_parser
from spectral_machinery import WaveletMachine
from texttable import Texttable
def tab_printer(args):
"""
Function to print the logs in a nice tabular format.
:param args: Parameters used for the model.
"""
# 输出相关参数
args = vars(args)
keys = sorted(args.keys())
tab = Texttable()
tab.add_rows([["Parameter", "Value"]])
tab.add_rows([[k.replace("_", " ").capitalize(), args[k]] for k in keys])
print(tab.draw())
def read_graph(settings):
"""
Reading the edge list from the path and returning the networkx graph object.
:param path: Path to the edge list.
:return graph: Graph from edge list.
"""
if settings.edgelist_input:
graph = nx.read_edgelist(settings.input)
else:
# 边表格式为node_a node_b (weight)
edge_list = pd.read_csv(settings.input, header=None, sep=' ').values.tolist()
# 若是有权图的话进行处理
if len(edge_list[0])==3:
graph = nx.read_weighted_edgelist(settings.input)
else:
graph = nx.from_edgelist(edge_list)
# 删除环路
graph.remove_edges_from(nx.selfloop_edges(graph))
return graph
if __name__ == "__main__":
# 获取参数
settings = parameter_parser()
# 打印参数
tab_printer(settings)
# 读取图
G = read_graph(settings)
# 建立一个graphwave运行机制类
machine = WaveletMachine(G, settings)
machine.create_embedding()
machine.transform_and_save_embedding()
-
param_parser.py(参数获取部分)
"""Parsing up the command line parameters."""
import argparse
def parameter_parser():
"""
A method to parse up command line parameters.
"""
parser = argparse.ArgumentParser(description="Run GraphWave.")
# 特征值计算方式
parser.add_argument("--mechanism",
nargs="?",# 0或1个参数
default="exact",
help="Eigenvalue calculation method. Default is exact.")
# 输入文件的路径
parser.add_argument("--input",
nargs="?",
default="../data/food_edges.csv",
help="Path to the graph edges. Default is food_edges.csv.")
# 输出文件的路径
parser.add_argument("--output",
nargs="?",
default="../output/embedding.csv",
help="Path to the structural embedding. Default is embedding.csv.")
# 热核特征参数
parser.add_argument("--heat-coefficient",
type=float,
default=1000.0,
help="Heat kernel exponent. Default is 1000.0.")
# 采样个数(即嵌入向量的维度d 最终结果是2d)
parser.add_argument("--sample-number",
type=int,
default=50,
help="Number of characteristic function sample points. Default is 50.")
# 用切比雪夫多项式逼近热核特征矩阵的计算
parser.add_argument("--approximation",
type=int,
default=100,
help="Number of Chebyshev approximation. Default is 100.")
# 步长,每隔这么多采样
parser.add_argument("--step-size",
type=int,
default=20,
help="Number of steps. Default is 20.")
parser.add_argument("--switch",
type=int,
default=100,
help="Number of dimensions. Default is 100.")
parser.add_argument("--node-label-type",
type=str,
default="int",
help="Used for sorting index of output embedding. One of 'int', 'string', or 'float'. Default is 'int'")
parser.add_argument("--edgelist-input",
action='store_true',
help="Use NetworkX's format for input instead of CSV. Default is False")
return parser.parse_args()
-
spectral_machinery.py(整个算法的核心部分)
"""GraphWave class implementation."""
import pygsp
import random
import numpy as np
import pandas as pd
from tqdm import tqdm
import networkx as nx
from pydoc import locate
class WaveletMachine:
"""
An implementation of "Learning Structural Node Embeddings Via Diffusion Wavelets".
"""
def __init__(self, G, settings):
"""
Initialization.
:param G: Input networkx graph object.
:param settings: argparse object with settings.
"""
# 获得节点的标号
self.index = G.nodes()
# 邻接矩阵
self.G = pygsp.graphs.Graph(nx.adjacency_matrix(G))
# 节点个数
self.number_of_nodes = len(nx.nodes(G))
# 参数设置
self.settings = settings
# 如果节点个数过多的话,为了节省时常,需要切换embedding的机制
if self.number_of_nodes > self.settings.switch:
self.settings.mechanism = "approximate"
# 采样的下标
self.steps = [x*self.settings.step_size for x in range(self.settings.sample_number)]
def single_wavelet_generator(self, node):
"""
Calculating the characteristic function for a given node, using the eigendecomposition.
:param node: Node that is being embedded.
"""
impulse = np.zeros((self.number_of_nodes))
impulse[node] = 1.0
# 计算热核特征
diags = np.diag(np.exp(-self.settings.heat_coefficient*self.eigen_values))
eigen_diag = np.dot(self.eigen_vectors, diags)
waves = np.dot(eigen_diag, np.transpose(self.eigen_vectors))
wavelet_coefficients = np.dot(waves, impulse)
return wavelet_coefficients
def exact_wavelet_calculator(self):
"""
Calculates the structural role embedding using the exact eigenvalue decomposition.
"""
# 嵌入向量后的实部、虚部部分
self.real_and_imaginary = []
for node in tqdm(range(self.number_of_nodes)):
# 生成当前节点的热核特征
wave = self.single_wavelet_generator(node)
# 加j成为虚数
# 根据特征函数进行采样
wavelet_coefficients = [np.mean(np.exp(wave*1.0*step*1j)) for step in self.steps]
self.real_and_imaginary.append(wavelet_coefficients)
self.real_and_imaginary = np.array(self.real_and_imaginary)
def exact_structural_wavelet_embedding(self):
"""
Calculates the eigenvectors, eigenvalues and an exact embedding is created.
"""
# 计算整个图的拉普拉斯矩阵特征值分解
self.G.compute_fourier_basis()
# G.e是拉普拉斯矩阵的特征值
self.eigen_values = self.G.e / max(self.G.e)
# G.U是拉普拉斯矩阵的特征向量
self.eigen_vectors = self.G.U
self.exact_wavelet_calculator()
def approximate_wavelet_calculator(self):
"""
Given the Chebyshev polynomial, graph the approximate embedding is calculated.
"""
self.real_and_imaginary = []
for node in tqdm(range(self.number_of_nodes)):
impulse = np.zeros((self.number_of_nodes))
impulse[node] = 1
wave_coeffs = pygsp.filters.approximations.cheby_op(self.G, self.chebyshev, impulse)
real_imag = [np.mean(np.exp(wave_coeffs*1*step*1j)) for step in self.steps]
self.real_and_imaginary.append(real_imag)
self.real_and_imaginary = np.array(self.real_and_imaginary)
def approximate_structural_wavelet_embedding(self):
"""
Estimating the largest eigenvalue.
Setting up the heat filter and the Cheybshev polynomial.
Using the approximate wavelet calculator method.
"""
# 估计拉普拉斯矩阵最大的特征值 结果被缓存在G.lmax()中。
self.G.estimate_lmax()
# 热核特征
# tau: Scaling parameter tau控制能量传播距离,tau越大,能量传播的越远
self.heat_filter = pygsp.filters.Heat(self.G, tau=[self.settings.heat_coefficient])
self.chebyshev = pygsp.filters.approximations.compute_cheby_coeff(self.heat_filter,
m=self.settings.approximation)
self.approximate_wavelet_calculator()
def create_embedding(self):
"""
Depending the mechanism setting creating an exact or approximate embedding.
"""
if self.settings.mechanism == "exact":
self.exact_structural_wavelet_embedding()
else:
self.approximate_structural_wavelet_embedding()
def transform_and_save_embedding(self):
"""
Transforming the numpy array with real and imaginary values.
Creating a pandas dataframe and saving it as a csv.
"""
print("\nSaving the embedding.")
features = [self.real_and_imaginary.real, self.real_and_imaginary.imag]
self.real_and_imaginary = np.concatenate(features, axis=1)
columns_1 = ["reals_"+str(x) for x in range(self.settings.sample_number)]
columns_2 = ["imags_"+str(x) for x in range(self.settings.sample_number)]
columns = columns_1 + columns_2
self.real_and_imaginary = pd.DataFrame(self.real_and_imaginary, columns=columns)
self.real_and_imaginary.index = self.index
self.real_and_imaginary.index = self.real_and_imaginary.index.astype(locate(self.settings.node_label_type))
self.real_and_imaginary = self.real_and_imaginary.sort_index()
self.real_and_imaginary.to_csv(self.settings.output)