hin2vec运行笔记&代码导图笔记

最新推荐文章于 2023-02-08 22:13:48 发布

wyc_98

最新推荐文章于 2023-02-08 22:13:48 发布

阅读量1.2k

点赞数

文章标签： python

本文链接：https://blog.csdn.net/qq_38399316/article/details/123896975

版权

参考：

GitHub - csiesheep/hin2vec: Heterogeneous information network representation learning

#机器学习 Micro-F1和Macro-F1详解_Troye Jcan的博客-CSDN博客_macro-f1

运行

例子：在Zachary的空手道俱乐部网络上运行HIN2Vec

python main.py res/karate_club_edges.txt node_vectors.txt metapath_vectors.txt -l 1000 -d 2 -w 2

%prog [options] <graph_fname> <node_vec_fname> <path_vec_fname>

    graph_fname: the graph file
        It can be a file contained edges per line (e.g., res/karate_club_edges.txt)
        or a pickled graph file.
    node_vec_fname: the output file for nodes' vectors
    path_vec_fname: the output file for meta-paths' vectors
    -l：'--walk-length',The length of each random walk, default: 100
    -d: '--dim',Dimensionality of word embeddings, default: 100
    -w: '--window',Max window length, default: 3
    -k: '--walk-num',The number of random walks starting from each node从每个节点开始的随机游走次数 (default: 10)
    -n: '--negative',Number of negative examples (>0) for negative sampling, 0 for hierarchical softmax (default: 5)
    -a: '--alpha',Starting learning rate (default: 0.025)
    -p: '--num_processes',Number of processes (default: 1)
    -c: '--allow-circle',Set to all circles in relationships between nodes (Default: not allow)
    -r: '--sigmoid_regularization',Use sigmoid function for regularization for meta-path vectors (Default: binary-step fun

karate_club_edges.txt

#source_node    source_class    dest_node    dest_class    edge_class
1    U    11    U    U-U
1    U    12    U    U-U
1    U    13    U    U-U
1    U    14    U    U-U
1    U    18    U    U-U

network.py

init

self.graph = {} #{from_id: {to_id: {edge_class_id: weight}}} 构成图

#TODO correct the name

self.class_nodes = {} #{node_class: set([node_id])}    结点类型及对应节点

self.edge_class2id = {} #{edge_class: edge_class_id}    边类型及对应编号

self.node2id = {} #{node: node_id}                       节点及对应id

self.k_hop_neighbors = {} #{k: {id_: set(to_ids)}}        

self.edge_class_id_available_node_class = {} # {edge_class_id:

                                             #  (from_node_class, to_node_class)}    节点类型id，头尾节点类型

random_walks

Generate random walks starting from each node
input:
    count: the # of random walks starting from each node    =10
    length: the maximam length of a random walk            =1000

output:
    [random_walk]
    random_walk: [<node_id>,<edge_class_id>,<node_id>, ...]    0 0 10 0 20 0 9 0 30 0 9 0 32 0 30 0 32

statement

statement = ("model_c/bin/hin2vec -size %d -train %s -alpha %f "
                 "-output %s -output_mp %s -window %d -negative %d "
                 "-threads %d -no_circle %d -sigmoid_reg %d "
                 "" % (options.dim,        #Dimensionality of word embeddings
                       tmp_walk_fname,     #训练文件名，里边是random walk
                       options.alpha,      #学习率
                       tmp_node_vec_fname, #输出的节点向量表示
                       tmp_path_vec_fname, #输出的matepath节点表示
                       options.window,     #窗口
                       options.neg,        #负采样个数
                       options.num_processes,        #进程数
                       1-(options.allow_circle * 1), #不允许有环
                       options.sigmoid_reg * 1))     #metapath的正则项

系统调用os.system(statement)，运行hin2vec文件：

model_c/bin/hin2vec -size 2 -train /tmp/tmphTXGnA -alpha 0.025000 -output /tmp/tmp1pwao9 -output_mp /tmp/tmpNfXWnF -window 2 -negative 5 -threads 1 -no_circle 1 -sigmoid_reg 0

输出：

Starting training using file /tmp/tmpxYLn1w #TrainModel()开始训练

Node size: 34 #输出节点id大小。LearnVocabFromTrainFile() <=读取节点->获取节点hash值->把node2vec->节点排序

Nodes in train file: 340000 #文件中节点个数

0 meta-path:0 339660 #元路径id，元路径类型U-U 及条数。LearnMpVocabFromTrainFile() SearchMpVocab返回mp的在文件中的位置

1 meta-path:00 339320 # 元路径类型U-U,U-U

Meta-path size: 2 #元路径类型个数

Meta-paths in train file: 678980 #文件中总的元路径条数

Alpha: 0.001544 Progress(330000/340000): 97.06% Words/thread/sec: 631.54k #TrainModelThread()

save node vectors

save mp vectors

----------------------------------------

pickle模块dump()和load()函数

dump序列化：把对象转换为字节序列的过程称为对象的序列化。

load反序列化：把字节序列恢复为对象的过程称为对象的反序列化。

python tools/exp_classification.py node_vectors.txt res/karate_club_groups.txt

pickle与cpickle比较

pickle完全用python来实现的，cpickle用C来实现的，cpickle的速度要比pickle快好多倍，电脑中如果有cpickle的话建议使用cpickle.。可以将对象转换为一种可以传输或存储的格式

Check the learned vectors

exp_classification.py

'''\

    %prog [options] <node_vec_fname> <groundtruth_fname>

    groundtruth_fname example: res/karate_club_groups.txt

    '''

扎卡里空手道俱乐部的成员自然分成两组。把对成员的组分类当作二元分类来处理:

python tools/exp_classification.py node_vectors.txt res/karate_club_groups.txt

（测试node_vector的分类精度）

输出：

1 16 34 #1类里，有16个

0.8266666666666665

2 18 34 #2类里，有18个

0.8914285714285715

[0.8266666666666665, 0.8914285714285715] #整个f1分数list

macro f1: 0.8590476190476191

micro f1: 0.860952380952

---------------------------------------------

F1-score

micro-F1：分别计算每个类别的F1，然后做平均（各类别F1的权重相同）

取值范围：(0, 1)；

适用环境：多分类不平衡，若数据极度不平衡会影响结果；

macro-F1：通过先计算总体的TP，FN和FP的数量，再计算F1

取值范围：(0, 1)；

适用环境：多分类问题，不受数据不平衡影响，容易受到识别性高（高recall、高precision）的类别影响；

wyc_98

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫