Similarity of two trees or graph based on networkx graph(zhang-shasha, tree edit distance)

The general solution is Zhang-Shasha tree-edit distance. This algorithm had bee open-sourced on github. address: https://github.com/timtadh/zhang-shasha.  The following introduction is a usage demo with networkx graph.

I want to compute the Zhang-Shasha tree-edit distance between 2 trees (zss library). However, my trees are in the form of networkx graphs (they actually represent DOM html trees). The example in the zss documentation shows how to create a tree by hand:

from zss import *
A = (
    Node("f")
        .addkid(Node("a")
            .addkid(Node("h"))
            .addkid(Node("c")
                .addkid(Node("l"))))
        .addkid(Node("e"))
    )
zss.simple_distance(A, A) # [0.0]

Which would be the same tree as:

import networkx as nx
G=nx.DiGraph()
G.add_edges_from([('f', 'a'), ('a', 'h'), ('a', 'c'), ('c', 'l'), ('f', 'e')])

so I would like to convert tree objects of networkx class into a zss Node object, then compute the edit distance between 2 trees.

Thanks

Using dfs_tree can definitely help:

import zss
import networkx as nx

G=nx.DiGraph()
G.add_edges_from([('f', 'a'), ('a', 'h'), ('a', 'c'), ('c', 'l'), ('f', 'e')])
T = nx.dfs_tree(G, source='f')
nodes_dict = {}
for edge in T.edges():
    if edge[0] not in nodes_dict:
        nodes_dict[edge[0]] = zss.Node(edge[0])
    if edge[1] not in nodes_dict:
        nodes_dict[edge[1]] = zss.Node(edge[1])
    nodes_dict[edge[0]].addkid(nodes_dict[edge[1]])

print(zss.simple_distance(nodes_dict['f'], nodes_dict['f'])) # 0.0

In case we don't know which node is G's root node, but know we have a valid tree, we can get the source node by calling:

source = [n for (n, d) in G.in_degree() if d == 0][0]
T = nx.dfs_tree(G, source=source)

Since the root is the only node with no incoming nodes, that should work.

  • thanks, you supply the source argument as the top node, i guess? How would I automatize this in a function, i mean could I initialize it as T = nx.dfs_tree(G, source=list(G.nodes)[0]), right? – agenis Jan 20 '19 at 22:29
  • That would work, as well as simply omitting the source argument and calling nx.dfs_tree(G). However, it assumes G has a specific node ordering, which is true in the above example, but depends on the way we build it. Added an edit to handle this step in the answer. – zohar.kom Jan 21 '19 at 7:52

While I found networkx has solutions about similiarity measurement, as shown bellow(networkx tree distance or similarity).

Similarity Measures

Functions measuring similarity using graph edit distance.

The graph edit distance is the number of edge/node changes needed to make two graphs isomorphic.

The default algorithm/implementation is sub-optimal for some graphs. The problem of finding the exact Graph Edit Distance (GED) is NP-hard so it is often slow. If the simple interface graph_edit_distance takes too long for your graph, try optimize_graph_edit_distance and/or optimize_edit_paths.

At the same time, I encourage capable people to investigate alternative GED algorithms, in order to improve the choices available.

graph_edit_distance(G1, G2[, node_match, …])

Returns GED (graph edit distance) between graphs G1 and G2.

optimal_edit_paths(G1, G2[, node_match, …])

Returns all minimum-cost edit paths transforming G1 to G2.

optimize_graph_edit_distance(G1, G2[, …])

Returns consecutive approximations of GED (graph edit distance) between graphs G1 and G2.

optimize_edit_paths(G1, G2[, node_match, …])

GED (graph edit distance) calculation: advanced interface.

simrank_similarity(G[, source, target, …])

Returns the SimRank similarity of nodes in the graph G.

simrank_similarity_numpy(G[, source, …])

Calculate SimRank of nodes in G using matrices with n

Here is a demo with networkx built in functions which recommend by official web compared with zss: 

networkx.algorithms.similarity.optimal_edit_paths

optimal_edit_paths(G1G2node_match=Noneedge_match=Nonenode_subst_cost=Nonenode_del_cost=Nonenode_ins_cost=Noneedge_subst_cost=Noneedge_del_cost=Noneedge_ins_cost=Noneupper_bound=None)[source]

Returns all minimum-cost edit paths transforming G1 to G2.

Graph edit path is a sequence of node and edge edit operations transforming graph G1 to graph isomorphic to G2. Edit operations include substitutions, deletions, and insertions.

Parameters

  • G1, G2 (graphs) – The two graphs G1 and G2 must be of the same type.

  • node_match (callable) – A function that returns True if node n1 in G1 and n2 in G2 should be considered equal during matching.

    The function will be called like

    node_match(G1.nodes[n1], G2.nodes[n2]).

    That is, the function will receive the node attribute dictionaries for n1 and n2 as inputs.

    Ignored if node_subst_cost is specified. If neither node_match nor node_subst_cost are specified then node attributes are not considered.

  • edge_match (callable) – A function that returns True if the edge attribute dictionaries for the pair of nodes (u1, v1) in G1 and (u2, v2) in G2 should be considered equal during matching.

    The function will be called like

    edge_match(G1[u1][v1], G2[u2][v2]).

    That is, the function will receive the edge attribute dictionaries of the edges under consideration.

    Ignored if edge_subst_cost is specified. If neither edge_match nor edge_subst_cost are specified then edge attributes are not considered.

  • node_subst_cost, node_del_cost, node_ins_cost (callable) – Functions that return the costs of node substitution, node deletion, and node insertion, respectively.

    The functions will be called like

    node_subst_cost(G1.nodes[n1], G2.nodes[n2]), node_del_cost(G1.nodes[n1]), node_ins_cost(G2.nodes[n2]).

    That is, the functions will receive the node attribute dictionaries as inputs. The functions are expected to return positive numeric values.

    Function node_subst_cost overrides node_match if specified. If neither node_match nor node_subst_cost are specified then default node substitution cost of 0 is used (node attributes are not considered during matching).

    If node_del_cost is not specified then default node deletion cost of 1 is used. If node_ins_cost is not specified then default node insertion cost of 1 is used.

  • edge_subst_cost, edge_del_cost, edge_ins_cost (callable) – Functions that return the costs of edge substitution, edge deletion, and edge insertion, respectively.

    The functions will be called like

    edge_subst_cost(G1[u1][v1], G2[u2][v2]), edge_del_cost(G1[u1][v1]), edge_ins_cost(G2[u2][v2]).

    That is, the functions will receive the edge attribute dictionaries as inputs. The functions are expected to return positive numeric values.

    Function edge_subst_cost overrides edge_match if specified. If neither edge_match nor edge_subst_cost are specified then default edge substitution cost of 0 is used (edge attributes are not considered during matching).

    If edge_del_cost is not specified then default edge deletion cost of 1 is used. If edge_ins_cost is not specified then default edge insertion cost of 1 is used.

  • upper_bound (numeric) – Maximum edit distance to consider.

Returns

  • edit_paths (list of tuples (node_edit_path, edge_edit_path)) – node_edit_path : list of tuples (u, v) edge_edit_path : list of tuples ((u1, v1), (u2, v2))

  • cost (numeric) – Optimal edit path cost (graph edit distance).

Examples

>>>

>>> G1 = nx.cycle_graph(4)
>>> G2 = nx.wheel_graph(5)
>>> paths, cost = nx.optimal_edit_paths(G1, G2)
>>> len(paths)
40
>>> cost
5.0

See also

graph_edit_distance()optimize_edit_paths()

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值