参考:
GitHub - csiesheep/hin2vec: Heterogeneous information network representation learning
#机器学习 Micro-F1和Macro-F1详解_Troye Jcan的博客-CSDN博客_macro-f1
-
运行
例子:在Zachary的空手道俱乐部网络上运行HIN2Vec
python main.py res/karate_club_edges.txt node_vectors.txt metapath_vectors.txt -l 1000 -d 2 -w 2
%prog [options] <graph_fname> <node_vec_fname> <path_vec_fname>
graph_fname: the graph file
It can be a file contained edges per line (e.g., res/karate_club_edges.txt)
or a pickled graph file.
node_vec_fname: the output file for nodes' vectors
path_vec_fname: the output file for meta-paths' vectors
-l:'--walk-length',The length of each random walk, default: 100
-d: '--dim',Dimensionality of word embeddings, default: 100
-w: '--window',Max window length, default: 3
-k: '--walk-num',The number of random walks starting from each node从每个节点开始的随机游走次数 (default: 10)
-n: '--negative',Number of negative examples (>0) for negative sampling, 0 for hierarchical softmax (default: 5)
-a: '--alpha',Starting learning rate (default: 0.025)
-p: '--num_processes',Number of processes (default: 1)
-c: '--allow-circle',Set to all circles in relationships between nodes (Default: not allow)
-r: '--sigmoid_regularization',Use sigmoid function for regularization for meta-path vectors (Default: binary-step fun
-
karate_club_edges.txt
#source_node source_class dest_node dest_class edge_class
1 U 11 U U-U
1 U 12 U U-U
1 U 13 U U-U
1 U 14 U U-U
1 U 18 U U-U
-
network.py
init
self.graph = {} #{from_id: {to_id: {edge_class_id: weight}}} 构成图
#TODO correct the name
self.class_nodes = {} #{node_class: set([node_id])} 结点类型及对应节点
self.edge_class2id = {} #{edge_class: edge_class_id} 边类型及对应编号
self.node2id = {} #{node: node_id} 节点及对应id
self.k_hop_neighbors = {} #{k: {id_: set(to_ids)}}
self.edge_class_id_available_node_class = {} # {edge_class_id:
# (from_node_class, to_node_class)} 节点类型id,头尾节点类型
random_walks
Generate random walks starting from each node
input:
count: the # of random walks starting from each node =10
length: the maximam length of a random walk =1000
output:
[random_walk]
random_walk: [<node_id>,<edge_class_id>,<node_id>, ...] 0 0 10 0 20 0 9 0 30 0 9 0 32 0 30 0 32
statement
statement = ("model_c/bin/hin2vec -size %d -train %s -alpha %f "
"-output %s -output_mp %s -window %d -negative %d "
"-threads %d -no_circle %d -sigmoid_reg %d "
"" % (options.dim, #Dimensionality of word embeddings
tmp_walk_fname, #训练文件名,里边是random walk
options.alpha, #学习率
tmp_node_vec_fname, #输出的节点向量表示
tmp_path_vec_fname, #输出的matepath节点表示
options.window, #窗口
options.neg, #负采样个数
options.num_processes, #进程数
1-(options.allow_circle * 1), #不允许有环
options.sigmoid_reg * 1)) #metapath的正则项
-
系统调用os.system(statement),运行hin2vec文件:
model_c/bin/hin2vec -size 2 -train /tmp/tmphTXGnA -alpha 0.025000 -output /tmp/tmp1pwao9 -output_mp /tmp/tmpNfXWnF -window 2 -negative 5 -threads 1 -no_circle 1 -sigmoid_reg 0
输出:
Starting training using file /tmp/tmpxYLn1w #TrainModel()开始训练
Node size: 34 #输出节点id大小。LearnVocabFromTrainFile() <=读取节点->获取节点hash值->把node2vec->节点排序
Nodes in train file: 340000 #文件中节点个数
0 meta-path:0 339660 #元路径id,元路径类型U-U 及条数。LearnMpVocabFromTrainFile() SearchMpVocab返回mp的在文件中的位置
1 meta-path:00 339320 # 元路径类型U-U,U-U
Meta-path size: 2 #元路径类型个数
Meta-paths in train file: 678980 #文件中总的元路径条数
Alpha: 0.001544 Progress(330000/340000): 97.06% Words/thread/sec: 631.54k #TrainModelThread()
save node vectors
save mp vectors
----------------------------------------
pickle模块dump()和load()函数
dump序列化:把对象转换为字节序列的过程称为对象的序列化。
load反序列化:把字节序列恢复为对象的过程称为对象的反序列化。
python tools/exp_classification.py node_vectors.txt res/karate_club_groups.txt
pickle与cpickle比较
pickle完全用python来实现的,cpickle用C来实现的,cpickle的速度要比pickle快好多倍,电脑中如果有cpickle的话建议使用cpickle.。可以将对象转换为一种可以传输或存储的格式
Check the learned vectors
-
exp_classification.py
'''\
%prog [options] <node_vec_fname> <groundtruth_fname>
groundtruth_fname example: res/karate_club_groups.txt
'''
扎卡里空手道俱乐部的成员自然分成两组。把对成员的组分类当作二元分类来处理:
python tools/exp_classification.py node_vectors.txt res/karate_club_groups.txt
(测试node_vector的分类精度)
输出:
1 16 34 #1类里,有16个
0.8266666666666665
2 18 34 #2类里,有18个
0.8914285714285715
[0.8266666666666665, 0.8914285714285715] #整个f1分数list
macro f1: 0.8590476190476191
micro f1: 0.860952380952
---------------------------------------------
-
F1-score
micro-F1:分别计算每个类别的F1,然后做平均(各类别F1的权重相同)
取值范围:(0, 1);
适用环境:多分类不平衡,若数据极度不平衡会影响结果;
macro-F1:通过先计算总体的TP,FN和FP的数量,再计算F1
取值范围:(0, 1);
适用环境:多分类问题,不受数据不平衡影响,容易受到识别性高(高recall、高precision)的类别影响;