The general solution is Zhang-Shasha tree-edit distance. This algorithm had bee open-sourced on github. address: https://github.com/timtadh/zhang-shasha. The following introduction is a usage demo with networkx graph.
I want to compute the Zhang-Shasha tree-edit distance between 2 trees (zss
library). However, my trees are in the form of networkx
graphs (they actually represent DOM html trees). The example in the zss documentation shows how to create a tree by hand:
from zss import *
A = (
Node("f")
.addkid(Node("a")
.addkid(Node("h"))
.addkid(Node("c")
.addkid(Node("l"))))
.addkid(Node("e"))
)
zss.simple_distance(A, A) # [0.0]
Which would be the same tree as:
import networkx as nx
G=nx.DiGraph()
G.add_edges_from([('f', 'a'), ('a', 'h'), ('a', 'c'), ('c', 'l'), ('f', 'e')])
so I would like to convert tree objects of networkx class into a zss
Node object, then compute the edit distance between 2 trees.
Thanks
Using dfs_tree
can definitely help:
import zss
import networkx as nx
G=nx.DiGraph()
G.add_edges_from([('f', 'a'), ('a', 'h'), ('a', 'c'), ('c', 'l'), ('f', 'e')])
T = nx.dfs_tree(G, source='f')
nodes_dict = {}
for edge in T.edges():
if edge[0] not in nodes_dict:
nodes_dict[edge[0]] = zss.Node(edge[0])
if edge[1] not in nodes_dict:
nodes_dict[edge[1]] = zss.Node(edge[1])
nodes_dict[edge[0]].addkid(nodes_dict[edge[1]])
print(zss.simple_distance(nodes_dict['f'], nodes_dict['f'])) # 0.0
In case we don't know which node is G's root node, but know we have a valid tree, we can get the source node by calling:
source = [n for (n, d) in G.in_degree() if d == 0][0]
T = nx.dfs_tree(G, source=source)
Since the root is the only node with no incoming nodes, that should work.
- thanks, you supply the
source
argument as the top node, i guess? How would I automatize this in a function, i mean could I initialize it asT = nx.dfs_tree(G, source=list(G.nodes)[0])
, right? – agenis Jan 20 '19 at 22:29 -
That would work, as well as simply omitting the source argument and calling
nx.dfs_tree(G)
. However, it assumes G has a specific node ordering, which is true in the above example, but depends on the way we build it. Added an edit to handle this step in the answer. – zohar.kom Jan 21 '19 at 7:52
While I found networkx has solutions about similiarity measurement, as shown bellow(networkx tree distance or similarity).
Similarity Measures
Functions measuring similarity using graph edit distance.
The graph edit distance is the number of edge/node changes needed to make two graphs isomorphic.
The default algorithm/implementation is sub-optimal for some graphs. The problem of finding the exact Graph Edit Distance (GED) is NP-hard so it is often slow. If the simple interface graph_edit_distance
takes too long for your graph, try optimize_graph_edit_distance
and/or optimize_edit_paths
.
At the same time, I encourage capable people to investigate alternative GED algorithms, in order to improve the choices available.
| Returns GED (graph edit distance) between graphs G1 and G2. |
| Returns all minimum-cost edit paths transforming G1 to G2. |
| Returns consecutive approximations of GED (graph edit distance) between graphs G1 and G2. |
| GED (graph edit distance) calculation: advanced interface. |
| Returns the SimRank similarity of nodes in the graph |
| Calculate SimRank of nodes in |
Here is a demo with networkx built in functions which recommend by official web compared with zss:
networkx.algorithms.similarity.optimal_edit_paths
optimal_edit_paths
(G1, G2, node_match=None, edge_match=None, node_subst_cost=None, node_del_cost=None, node_ins_cost=None, edge_subst_cost=None, edge_del_cost=None, edge_ins_cost=None, upper_bound=None)[source]
Returns all minimum-cost edit paths transforming G1 to G2.
Graph edit path is a sequence of node and edge edit operations transforming graph G1 to graph isomorphic to G2. Edit operations include substitutions, deletions, and insertions.
Parameters
-
G1, G2 (graphs) – The two graphs G1 and G2 must be of the same type.
-
node_match (callable) – A function that returns True if node n1 in G1 and n2 in G2 should be considered equal during matching.
The function will be called like
node_match(G1.nodes[n1], G2.nodes[n2]).
That is, the function will receive the node attribute dictionaries for n1 and n2 as inputs.
Ignored if node_subst_cost is specified. If neither node_match nor node_subst_cost are specified then node attributes are not considered.
-
edge_match (callable) – A function that returns True if the edge attribute dictionaries for the pair of nodes (u1, v1) in G1 and (u2, v2) in G2 should be considered equal during matching.
The function will be called like
edge_match(G1[u1][v1], G2[u2][v2]).
That is, the function will receive the edge attribute dictionaries of the edges under consideration.
Ignored if edge_subst_cost is specified. If neither edge_match nor edge_subst_cost are specified then edge attributes are not considered.
-
node_subst_cost, node_del_cost, node_ins_cost (callable) – Functions that return the costs of node substitution, node deletion, and node insertion, respectively.
The functions will be called like
node_subst_cost(G1.nodes[n1], G2.nodes[n2]), node_del_cost(G1.nodes[n1]), node_ins_cost(G2.nodes[n2]).
That is, the functions will receive the node attribute dictionaries as inputs. The functions are expected to return positive numeric values.
Function node_subst_cost overrides node_match if specified. If neither node_match nor node_subst_cost are specified then default node substitution cost of 0 is used (node attributes are not considered during matching).
If node_del_cost is not specified then default node deletion cost of 1 is used. If node_ins_cost is not specified then default node insertion cost of 1 is used.
-
edge_subst_cost, edge_del_cost, edge_ins_cost (callable) – Functions that return the costs of edge substitution, edge deletion, and edge insertion, respectively.
The functions will be called like
edge_subst_cost(G1[u1][v1], G2[u2][v2]), edge_del_cost(G1[u1][v1]), edge_ins_cost(G2[u2][v2]).
That is, the functions will receive the edge attribute dictionaries as inputs. The functions are expected to return positive numeric values.
Function edge_subst_cost overrides edge_match if specified. If neither edge_match nor edge_subst_cost are specified then default edge substitution cost of 0 is used (edge attributes are not considered during matching).
If edge_del_cost is not specified then default edge deletion cost of 1 is used. If edge_ins_cost is not specified then default edge insertion cost of 1 is used.
-
upper_bound (numeric) – Maximum edit distance to consider.
Returns
-
edit_paths (list of tuples (node_edit_path, edge_edit_path)) – node_edit_path : list of tuples (u, v) edge_edit_path : list of tuples ((u1, v1), (u2, v2))
-
cost (numeric) – Optimal edit path cost (graph edit distance).
Examples
>>>
>>> G1 = nx.cycle_graph(4)
>>> G2 = nx.wheel_graph(5)
>>> paths, cost = nx.optimal_edit_paths(G1, G2)
>>> len(paths)
40
>>> cost
5.0
See also