1.0 DeepWalk的提出
Word2Vec是基于序列进行Embedding;但是随着实体之间的关系越来越复杂,网络化,此时序列Embedding需要升级为Graph Ebedding;
1.1 基本概念:
节点的度:在图论中,和该节点相关联的边的条数,特别地,对于有向图,进入该节点边的条数称为节点的入度;从该节点发出边的条数称为出度;
1.2 DeepWalk的步骤
- 构建每一个节点的随机游走序列, 没有个游走序列有两种截断方式: 游走长度达到给定值 和 该节点没有更多的 邻居节点;
- 采用word2vec算法(skip-gram)进行训练,生成每个节点的Embedding表示;
- DeepWalk = RandomWalk + SkipGram
1.3 DeepWalk的特点
数据量比较稀疏的时候,仍然有比较好的表现;
能够实现并行化操作,节约时间开销;
支持大规模在线执行预测;
1.4 DeepWalk实现代码
1.4.1 算法流程
随机游走
SkipGram
1.4.2 Demo
import networkx as nx
import pandas as pd
import numpy as np
import random
from tqdm import tqdm
from sklearn.decomposition import PCA
from gensim.models import Word2Vec
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv("space_data.tsv", sep = "\t")
# construct an undirected graph
G=nx.from_pandas_edgelist(df, "source", "target",
edge_attr=True, create_using=nx.Graph())
# function to generate random walk sequences of nodes
def get_randomwalk(node, path_length):
random_walk = [node]
for i in range(path_length-1):
temp1 = set(G.neighbors(node))
temp = set(temp1) - set(random_walk)
if len(temp) == 0:
break
# 随机游走中的随机,体现在这里
random_node = random.choice(list(temp))
random_walk.append(random_node)
node = random_node
return random_walk
get_randomwalk('astronaut', 10)
all_nodes = list(G.nodes())
random_walks = []
for n in tqdm(all_nodes):
for i in range(5):
random_walks.append(get_randomwalk(n,10))
# train word2vec model
model = Word2Vec(window = 4, sg = 1, hs = 0,
negative = 10, # for negative sampling
alpha=0.03, min_alpha=0.0007,
seed = 14)
model.build_vocab(random_walks, progress_per=2)
model.train(random_walks, total_examples = model.corpus_count, epochs=20, report_delay=1)
model.similar_by_word('astronaut training')
相关论文连接:
https://github.com/wzhe06/Reco-papers/blob/master/Embedding/%5BGraph%20Embedding%5D%20DeepWalk-%20Online%20Learning%20of%20Social%20Representations%20%28SBU%202014%29.pdf