代码运行
对给定的图数据集生成节点的embedding:
- 先对图中的节点进行随机游走
- 再将随机游走的路径作为Word2Vec的输入,生成节点的embedding
执行代码如下:
python __main__.py --input ../example_graphs/karate.adjlist --output ../example_graphs/karate.embeddings
python __main__.py --format mat --input ../example_graphs/blogcatalog.mat --number-walks 80 --representation-size 128 --walk-length 40 --window-size 10 --workers 1 --output ../example_graphs/blogcatalog.embeddings
python scoring.py --emb blogcatalog.embeddings --network blogcatalog.mat --num-shuffle 10 --all
下面是对各个文件进行的简单分析。
main.py
# __main__.py
import graph
import walks
graph里面有random walk
walks里也有random walk
两者的主要区别是前者针对小数据,后这针对大数据,前者采样的路径不会被序列化,后者会被序列化的本地磁盘.但是后者使用的randwalk仍然是graph里定义的方法,实际上是在调用graph.
parser.add_argument('--format', default='adjlist',
help='File format of input file')
这里程序允许三种类型的输入( ‘adjlist’,‘edgelist’,‘mat’),默认是’adjlist’。
if data_size < args.max_memory_data_size:
print("Walking...")
walks = graph.build_deepwalk_corpus(G, num_paths=args.number_walks,
path_length=args.walk_length, alpha=0, rand=random.Random(args.seed))
print("Training...")
model = Word2Vec(walks, size=args.representation_size, window=args.window_size, min_count=0, sg=1, hs=1, workers=args.workers)
这里调用build_deep_corpus,是从每个节点开始进行多次随机游走。
num_path:设置了从一个节点开始的次数
path_length:设置了一个随机游走的长度
alpha:以概率1-alpha从当前节点继续走下去,或者以alpha的概率停止
训练的过程是要将随机游走得到的walks放进Word2Vec模型,从而得到节点对应的embedding.
else:
print("Data size {} is larger than limit (max-memory-data-size: {}). Dumping walks to disk.".format(data_size, args.max_memory_data_size))
print("Walking...")
walks_filebase = args.output + ".walks"
walk_files = serialized_walks.write_walks_to_disk(G, walks_filebase, num_paths=args.number_walks,
path_length=args.walk_length, alpha=0, rand=random.Random(args.seed), num_workers=args.workers)
print("Counting vertex frequency...")
if not args.vertex_freq_degree:
vertex_counts = serialized_walks.count_textfiles(walk_files, args.workers)
else:
# use degree distribution for frequency in tree
vertex_counts = G.degree(nodes=G.iterkeys())
print("Training...")
walks_corpus = serialized_walks.WalksCorpus(walk_files)
model = Skipgram(sentences=walks_corpus, vocabulary_counts=vertex_counts,
size=args.representation_size,
window=args.window_size, min_count=0, trim_rule=None, workers=args.workers)
当指定内存不足以存放游走结果,游走的路径会被存入一系列文件output.walks.x中。程序结束后会出现两个文件,一个是file_path,一个是file_path.walks.0,file_path.walks.1,…file_path.walks.x。
file_path存的是各个节点的embedding,
output.walks.x存的是采样的游走路径,x表示这个文件是第x个处理器存入的。
其实,只需要知道serialized_walks.write_walks_to_disk本质上也是在调用graph里的randwalk,只不过包装了一下,加入了并行化代码和写到磁盘的程序。
graph.py
def __init__(self):
super(Graph, self).__init__(list)
# super().__init__(list) #python 3.x中的语法
这里,构建的图是一个字典,key是节点,key对应的value是list。
如果构建了一个graph实例然后输出出来,会得到:
defaultdict(
<class 'list'>,
{
1: [2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 18, 20, 22, 32],
2: [1, 3, 4, 8, 14, 18, 20, 22, 31],
……
34: [9, 10, 14, 15, 16, 19, 20, 21, 23, 24, 27, 28, 29, 30, 31, 32, 33]
}
)
def make_undirected(self):
t0 = time()
for v in self.keys():
for other in self[v]:
if v != other:
self[other].append(v)
t1 = time()
logger.info('make_directed: added missing edges {}s'.format(t1-t0))
self.make_consistent()
return self
def make_consistent(self):
t0 = time()
for k in iterkeys(self):
self[k] = list(sorted(set(self[k])))
t1 = time()
logger.info('make_consistent: made consistent in {}s'.format(t1-t0))
self.remove_self_loops()
return self
def remove_self_loops(self):
removed = 0
t0 = time()
for x in self:
if x in self[x]:
self[x].remove(x)
removed += 1
t1 = time()
logger.info('remove_self_loops: removed {} loops in {}s'.format(removed, (t1-t0)))
return self
def check_self_loops(self):
for x in self:
for y in self[x]:
if x == y:
return True
return False
def random_walk(self, path_length, alpha=0, rand=random.Random(), start=None):
""" Returns a truncated random walk.
path_length: Length of the random walk.
alpha: probability of restarts.
start: the start node of the random walk.
"""
G = self
if start:
path = [start]
else:
# Sampling is uniform w.r.t V, and not w.r.t E
path = [rand.choice(list(G.keys()))]
while len(path) < path_length:
cur = path[-1]
if len(G[cur]) > 0:
if rand.random() >= alpha:
#这条语句成立的概率是1-alpha,即以1-alpha的概率从当前节点继续向前走,以alpha的概率restart
path.append(rand.choice(G[cur]))
else:
path.append(path[0])
else:
break
return [str(node) for node in path]
Graph类中的这些函数主要是为了检查生成的随机游走路径,需要去掉重复节点,去掉自循环。
random_walk函数是要生成以startnode开始的随机游走路径。
def build_deepwalk_corpus(G, num_paths, path_length, alpha=0,
rand=random.Random(0)):
walks = []
nodes = list(G.nodes())
print(nodes)
for cnt in range(num_paths):
rand.shuffle(nodes)
for node in nodes:
walks.append(G.random_walk(path_length, rand=rand, alpha=alpha, start=node))
return walks
这个函数能够对一个图生成一个语料库,即从每个节点开始的多次随机游走路径。
def build_deepwalk_corpus_iter(G, num_paths, path_length, alpha=0,
rand=random.Random(0)):
walks = []
nodes = list(G.nodes())
for cnt in range(num_paths):
rand.shuffle(nodes)
for node in nodes:
yield G.random_walk(path_length, rand=rand, alpha=alpha, start=node)
这个函数也是生成语料库,只不过是迭代生成的。适用于设定内存不足以保存路径的情况。
walks.py
def count_words(file):
""" Counts the word frequences in a list of sentences.
Note:
This is a helper function for parallel execution of `Vocabulary.from_text`
method.
"""
c = Counter()
with open(file, 'r') as f:
for l in f:
words = l.strip().split()
c.update(words)
return c
def count_textfiles(files, workers=1):
c = Counter()
with ProcessPoolExecutor(max_workers=workers) as executor:
for c_ in executor.map(count_words, files):
c.update(c_)
return c
这里有两个计算词频的函数。
count_words的参数file中每行是一个walk,函数最终返回这个file中每个单词出现的次数。
count_textfiles是使用了多线程的技巧:ProcessPoolExecutor方法
可以了解一下python并行编程:python并行编程 中文版
def write_walks_to_disk(G, filebase, num_paths, path_length, alpha=0, rand=random.Random(0), num_workers=cpu_count(),
always_rebuild=True):
global __current_graph
__current_graph = G
files_list = ["{}.{}".format(filebase, str(x)) for x in list(range(num_paths))]
expected_size = len(G)
args_list = []
files = []
if num_paths <= num_workers:
paths_per_worker = [1 for x in range(num_paths)]
else:
paths_per_worker = [len(list(filter(lambda z: z!= None, [y for y in x])))
for x in graph.grouper(int(num_paths / num_workers)+1, range(1, num_paths+1))]
with ProcessPoolExecutor(max_workers=num_workers) as executor:
for size, file_, ppw in zip(executor.map(count_lines, files_list), files_list, paths_per_worker):
if always_rebuild or size != (ppw*expected_size):
args_list.append((ppw, path_length, alpha, random.Random(rand.randint(0, 2**31)), file_))
else:
files.append(file_)
这里是将walks写入文件。