我正在根据总共82927个定向电子邮件数据构建的Networkx .MultiDiGraph()对象.在当前阶段,我试图从.MultiDiGraph()对象及其对应的子图中获取最大的强连接组件.
文本数据可以访问here.
这是我的工作代码:
import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
email_df = pd.read_csv('email_network.txt', delimiter = '->')
edge_groups = email_df.groupby(["#Sender", "Recipient"], as_index=False).count().rename(columns={"time":"weight"})
email = nx.from_pandas_dataframe(edge_groups, '#Sender', 'Recipient', edge_attr = 'weight')
G = nx.MultiDiGraph()
G.add_edges_from(email.edges(data=True))
# G is a .MultiDiGraph object
# using .strongly_connected_components() to get the part of G that has the most nodes
# using list comprehension
number_of_nodes = [len(n) for n in sorted(nx.strongly_connected_components(G))]
number_of_nodes
# 'number_of_nodes' return a list of [1, 1, 1,...,1] of length 167 (which is the exact number of nodes in the network)
# using the recommended method in networkx documentation
largest = max(nx.strongly_connected_components(G), key=len)
largest
# 'largest' returns {92}, not sure what this means...
正如我在上面的代码块中所指出的那样,列表理解方法返回一个长度为167 [[,1,1,…,1]的列表(这是我数据中节点的总数),而max( nx.strongly_connected_components(G),key = len)返回了{92},我不确定这是什么意思.
我的代码似乎有问题,我可能错过了处理数据的几个关键步骤.有人可以看一看并启发我吗?
谢谢.
注意:修改后的代码(对Eric和Joel表示敬意)
import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
email_df = pd.read_csv('email_network.txt', delimiter = '')
edge_groups = email_df.groupby(["#Sender", "Recipient"], as_index=False).count().rename(columns={"time":"weight"})
# per @Joel's comment, adding 'create_using = nx.DiGraph()'
email = nx.from_pandas_dataframe(edge_groups, '#Sender', 'Recipient', edge_attr = 'weight', create_using = nx.DiGraph())
# adding this 'directed' edge list to .MultiDiGraph() object
G = nx.MultiDiGraph()
G.add_edges_from(email.edges(data=True))
现在,我们检查该网络中最大的强连接组件(就节点数而言).
In [1]: largest = max(nx.strongly_connected_components(G), key=len)
In [2]: len(largest)
Out [2]: 126
最大的强连接组件包含126个节点.
[更新]
经过进一步的试验和错误,我发现将数据加载到networkx时需要使用create_using = .MultiDiGraph()(而不是.DiGraph()),否则,即使您为MultiDiGraph及其弱/紧密连接的子图,您仍然可能得到错误的边数!这将反映在您的.strongly_connected_subgraphs()输出中.
就我的情况而言,我将建议其他人使用这种单线
import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
G = nx.read_edgelist(path="email_network.txt", data=[('time', int)], create_using=nx.MultiDiGraph(), nodetype=str)
并且我们可以实现.strongly_connected_components(G)和strongly_connected_subgraphs进行验证.
如果您使用第一个代码块中的networkx输出G,则max(nx.strongly_connected_components(G),key = len)将给出具有126个节点和52xx边缘的输出,但是如果您应用上面列出的单线,你会得到:
In [1]: largest = max(nx.strongly_connected_components(G), key=len)
In [2]: G_sc = max(nx.strongly_connected_subgraphs(G), key=len)
In [3]: nx.number_of_nodes(G_sc)
Out [3]: 126
In [4]: nx.number_of_nodes(G_sc)
Out [4]: 82130
两种方法都将获得相同数量的节点,但由于与不同的networkx图类相关的计数机制不同,边缘的数量也会有所不同.