程序设计三:包和模块——网络分析
一、GraphStat
包的整体结构
图的处理包为GraphStat
,其中包含有两个子包NetworkBuilder
,Visualization
。
子包中NetworkBuilder
有三个模块,分别是:graph
,node
,stat
,用来实现点和图结构的创建,以及相关的基础统计功能。
子包中包含两个模块,分别是:plotgraph
,plotnodes
,能基于上述构建的图和节点结构,绘制相关的统计结果。
包的文件结构如下所示:
二、NetworkBuilder
包
1. node
模块
在该模块下:
init_node
函数从文件中加载节点属性,
get_degree
函数获取各个节点的度,
print_node
函数用来输出某个节点的属性
代码如下:
import csv
from collections import defaultdict
def init_node(file_path_edge, flie_path_feature):
edge_info = []
feature_info = []
with open(file_path_edge, encoding='utf-8') as e:
reader = csv.reader(e)
for each_row in reader:
edge_info.append(each_row)
with open(flie_path_feature, encoding='utf-8') as f:
reader = csv.reader(f)
for each_row in reader:
feature_info.append(each_row)
return edge_info[1:], feature_info[1:]
def get_degree(edge_info):
degree_dict = defaultdict(lambda:0)
for edge in edge_info:
degree_dict[int(edge[0])] += 1
degree_dict[int(edge[1])] += 1
return degree_dict
def print_node(node_id, feature_info,degree_dict):
node = feature_info[node_id]
print('views={}\tmature={}\tlife_time={}\tcreated_at={}\tupdated_at={}'
'numeric_id={}\tdead_account={}\tlanguage={}\taffiliate={}\tdegree={}'
.format(node[0], node[1], node[2], node[3], node[4], node[5], node[6],
node[7], node[8],degree_dict[node_id]))
在__main__文件的主函数中调用node
模块中的各个函数,部分代码如下:
file_path_edge = '程设作业3/dataset/large_twitch_edges.csv'
flie_path_feature = '程设作业3/dataset/large_twitch_features.csv'
edge_info, feature_info = node.init_node(file_path_edge, flie_path_feature)
degree_dict = node.get_degree(edge_info)
print(degree_dict)
node.print_node(2, feature_info, degree_dict)
调用node.get_degree
函数计算,并将各个节点的度输出,部分结果如下:
node.print_node
输出的结果如下:
2. graph
模块
该模块下:
init_graph
函数初始化图,构建网络
save_graph
函数用来将图序列化存储
load_graph
函数用来加载打开序列化的图文件
import networkx as nx
import pickle
def init_graph(edge_info, feature_info):
nodes = []
for i in range(len(feature_info)):
nodes.append(feature_info[i][5])
edges = edge_info
graph = nx.Graph()
graph.add_nodes_from(nodes)
graph.add_edges_from(edges)
return graph
def save_graph(graph):
filepath = 'D:/Documents/Python_datasave/程设作业3/dataset/graph_file.pkl'
with open(filepath, mode='wb') as f:
pickle.dump(graph, f, pickle.HIGHEST_PROTOCOL)
return filepath
def load_graph(filepath):
with open(filepath, mode='rb') as f:
graph = pickle.load(f)
return graph
在__main__文件的主函数中调用该模块中的上述三个函数,代码如下:
G = graph.init_graph(edge_info, feature_info)
filepath = graph.save_graph(G)
G2 = graph.load_graph(filepath)
print(G)
print(G2)
输出结果如下,可见网络G被序列化存储和读取之后内容不变,序列化操作成功。
序列化之后的结果存储成了二进制文件,文件打开如下:
3. stat
模块
该模块下:
get_node_number
函数计算节点数
get_edge_number
函数计算边数
cal_average_dgree
函数计算网络中的平均度
cal_degree_distribution
函数计算网络的度分布
cal_views_distribution
函数计算 views 属性的分布
from collections import defaultdict
import matplotlib.pyplot as plt
def get_node_number(feature_info): # 计算节点数
return len(feature_info)
def get_edge_number(edge_info): # 计算边数
return len(edge_info)
def cal_average_dgree(feature_info, edge_info): # 计算网络中的平均度
node_number = get_node_number(feature_info)
edge_number = get_edge_number(edge_info)
return edge_number * 2 / node_number
def cal_degree_distribution(degree_dict): # 计算网络的度分布
degree_distribution = defaultdict(lambda: 0)
degree_lis = []
for node in degree_dict.keys():
degree_distribution[node] += 1
degree_lis.append(degree_dict[node])
num_bins = [i for i in range(10, 5000, 10)]
plt.hist(degree_lis, num_bins, histtype='bar', rwidth=5)
plt.show()
return degree_distribution
def cal_views_distribution(feature_info): # 计算 views 属性的分布
views_distribution = defaultdict(lambda: 0)
views_lis = []
for node in feature_info:
views_distribution[node[0]] += 1
views_lis.append(int(node[0]))
num_bins = [i for i in range(1000, 300000, 1000)]
plt.hist(views_lis, num_bins, histtype='bar', rwidth=5)
plt.show()
return views_distribution
在__main__文件的主函数中调用上述方法,代码如下:
node_number = stat.get_node_number(feature_info)
edge_number = stat.get_edge_number(feature_info)
average_dgree = stat.cal_average_dgree(feature_info, edge_info)
stat.cal_degree_distribution(degree_dict)
stat.cal_views_distribution(feature_info)
print('node_number={},edge_number={},average_dgree={}'.format(node_number, edge_number, average_dgree))
输出节点数目、边的数目、平均度的结果如下
绘制出度的分布图如下:
绘制出节点浏览量(views)的分布如下图:
三、Visualization
包
1. plotgraph
模块
import networkx as nx
import matplotlib.pyplot as plt
def plot_ego(graph):
selected_edge = []
degree_lis = graph.degree
selected_node = '1'
for node in degree_lis:
if node[1] == 20:
selected_node = node[0]
break
selected_nodes = list(nx.all_neighbors(graph, selected_node))
edges = []
for node in selected_nodes:
edges.extend(nx.edges(graph, node))
for edge in edges:
if edge[0] in selected_nodes and (edge[1] in selected_node):
selected_edge.append(edge)
G2 = nx.Graph()
G2.add_edges_from(selected_edge)
print(G2)
pos1 = nx.spring_layout(G2)
nx.draw(G2, pos1, with_labels=True, node_size = 50,font_weight='bold',)
plt.show()
def plotdgree_distribution(graph):
degree_lis = graph.degree
y = []
for degree in degree_lis:
y.append(degree[1])
num_bins = [i for i in range(0, 1000, 5)]
plt.title('degree_hist')
plt.hist(y, num_bins, histtype='bar', rwidth=5)
plt.show()
调用plot_ego
函数绘制子图,结果如下(选择度为10的节点)
调用函数绘制各节点度的分布图,结果如下:
2.plotnodes
模块
函数plot_nodes_attr
绘制节点属性的分布,
包括节点的浏览量(views_lis),节点的生存时间(life_time_lis)。
import matplotlib.pyplot as plt
def plot_nodes_attr(graph, feature_info):
degree = graph.degree
views_lis = []
life_time_lis = []
for node in feature_info:
views_lis.append(int(node[0]))
life_time_lis.append(int(node[2]))
view_bins = [i for i in range(1000, 200000, 1000)]
life_time_bins = [i for i in range(10, 4000, 10)]
plt.subplot(121)
plt.title('view_hist')
plt.hist(views_lis, view_bins, histtype='bar', rwidth=5)
plt.subplot(122)
plt.title('life_time_hist')
plt.hist(life_time_lis, life_time_bins, histtype='bar', rwidth=5)
plt.show()
绘图结果如下:
四、讨论
根据统计结果可以发现大部分节点的度较低,存在少量度较高的节点。这些度高的节点和其它节点的联系更为紧密,有着更好的影响力和传播力,因此,在进行营销和试用时,应当选择这些度较高的节点,可以获得有代表性的反馈和广泛的传播,产生更高的效益。