本文为学习笔记,记录了由University of Michigan推出的Coursera专项课程——Applied Data Science with Python中Course Five: Applied Social Network Analysis in Python全部Assignment代码,均已通过测试,得分均为100/100。
目录
Module 1: Why Study Networks and Basics on NetworkX
Assignment 1 - Creating and Manipulating Graphs
Module 2: Network Connectivity
Assignment 2 - Network Connectivity
Module 3: Influence Measures and Network Centralization
Part 1 - Random Graph Identification
Part 2B - New Connections Prediction
Module 1: Why Study Networks and Basics on NetworkX
Assignment 1 - Creating and Manipulating Graphs
Eight employees at a small company were asked to choose 3 movies that they would most enjoy watching for the upcoming company movie night. These choices are stored in the file assets/Employee_Movie_Choices.txt
.
A second file, assets/Employee_Relationships.txt
, has data on the relationships between different coworkers.
The relationship score has value of -100
(Enemies) to +100
(Best Friends). A value of zero means the two employees haven't interacted or are indifferent.
Both files are tab delimited.
import networkx as nx
import pandas as pd
import numpy as np
# This is the set of employees
employees = set(['Pablo',
'Lee',
'Georgia',
'Vincent',
'Andy',
'Frida',
'Joan',
'Claude'])
# This is the set of movies
movies = set(['The Shawshank Redemption',
'Forrest Gump',
'The Matrix',
'Anaconda',
'The Social Network',
'The Godfather',
'Monty Python and the Holy Grail',
'Snakes on a Plane',
'Kung Fu Panda',
'The Dark Knight',
'Mean Girls'])
# you can use the following function to plot graphs
# make sure to comment it out before submitting to the autograder
def plot_graph(G, weight_name=None):
'''
G: a networkx G
weight_name: name of the attribute for plotting edge weights (if G is weighted)
'''
#%matplotlib notebook
import matplotlib.pyplot as plt
plt.figure()
pos = nx.spring_layout(G)
edges = G.edges()
weights = None
if weight_name:
weights = [int(G[u][v][weight_name]) for u,v in edges]
labels = nx.get_edge_attributes(G,weight_name)
nx.draw_networkx_edge_labels(G,pos,edge_labels=labels)
nx.draw_networkx(G, pos, width=weights);
else:
nx.draw_networkx(G, pos,);
Question 1
Using NetworkX, load in the bipartite graph from assets/Employee_Movie_Choices.txt
and return that graph.
This function should return a bipartite networkx graph with 19 nodes and 24 edges
def answer_one():
from networkx.algorithms import bipartite
# YOUR CODE HERE
G = bipartite.read_edgelist('assets/Employee_Movie_Choices.txt', delimiter="\t")
return G
# raise NotImplementedError()
Question 2
Using the graph from the previous question, add nodes attributes named 'type'
where movies have the value 'movie'
and employees have the value 'employee'
and return that graph.
This function should return a bipartite networkx graph with node attributes {'type': 'movie'}
or {'type': 'employee'}
def answer_two():
# YOUR CODE HERE
G = answer_one()
for node in G.nodes():
if node in movies:
G.add_node(node, type='movie')
elif node in employees:
G.add_node(node, type='employee')
return G
# raise NotImplementedError()
Question 3
Find a weighted projection of the graph from answer_two
which tells us how many movies different pairs of employees have in common.
This function should return a weighted projected graph.
def answer_three():
# YOUR CODE HERE
from networkx.algorithms import bipartite
G = answer_two()
G = bipartite.weighted_projected_graph(G, employees)
return G
# raise NotImplementedError()
Question 4
Suppose you'd like to find out if people that have a high relationship score also like the same types of movies.
Find the pearson correlation between employee relationship scores and the number of movies they have in common. If two employees have no movies in common it should be treated as a 0, not a missing value, and should be included in the correlation calculation.
This function should return a float.
def answer_four():
# YOUR CODE HERE
from scipy.stats import pearsonr
df = pd.read_csv('assets/Employee_Relationships.txt', header=None, delimiter="\t")
edges = answer_three().edges(data=True)
edges_dict = {}
for employee1, employee2, attr in edges:
edges_dict[tuple(sorted((employee1, employee2)))] = attr['weight']
employee_pairs = [(employee1, employee2) for employee1, employee2 in zip(df[0], df[1])]
num_movies = []
for employee_pair in employee_pairs:
if employee_pair in edges_dict:
num_movies.append(edges_dict[employee_pair])
else:
num_movies.append(0)
relation_score = df[2]
result = np.corrcoef(relation_score, num_movies)[0][1]
return result
# raise NotImplementedError()
Module 2: Network Connectivity
Assignment 2 - Network Connectivity
In this assignment you will go through the process of importing and analyzing an internal email communication network between employees of a mid-sized manufacturing company. Each node represents an employee and each directed edge between two nodes represents an individual email. The left node represents the sender and the right node represents the recipient. We will also store the timestamp of each email.
import networkx as nx
#!head assets/email_network.txt
Question 1
Using networkx, load up the directed multigraph from assets/email_network.txt
. Make sure the node names are strings.
This function should return a directed multigraph networkx graph.
def answer_one():
# YOUR CODE HERE
G = nx.read_edgelist("assets/email_network.txt",delimiter="\t", create_using=nx.MultiDiGraph, nodetype=str, data=(("timestamp", int),))
return G
# raise NotImplementedError()
Question 2
How many employees are represented in the network?
How many sender
->recipient
pairs of employees are there in the network such that sender
sent at least one email to recipient
? Note that even if a sender
sent multiple messages to a recipient
, they should only be counted once. You should not exclude cases where an employee sent emails to themselves from this [email] count.
This function should return a tuple with two integers (#employees, # sender
->recipient
pairs).
def answer_two():
# YOUR CODE HERE
G = answer_one()
num_employees = G.number_of_nodes()
num_sender_to_recipient = len(set(G.edges()))
return num_employees, num_sender_to_recipient
# raise NotImplementedError()
Question 3
-
Part 1. Assume that information in this company can only be exchanged through email.
When an employee sends an email to another employee, a communication channel has been created, allowing the sender to provide information to the reciever, but not viceversa.
Based on the emails sent in the data, is it possible for information to go from every employee to every other employee?
-
Part 2. Now assume that a communication channel established by an email allows information to be exchanged both ways.
Based on the emails sent in the data, is it possible for information to go from every employee to every other employee?
This function should return a tuple of bools (part1, part2).
def answer_three():
# YOUR CODE HERE
G = answer_one()
part1 = nx.is_strongly_connected(G)
part2 = nx.is_weakly_connected(G)
return part1, part2
# raise NotImplementedError()
Question 4
How many nodes are in the largest weakly connected component of the graph?
This function should return an int.
def answer_four():
# YOUR CODE HERE
G = answer_one()
return len(sorted(nx.weakly_connected_components(G), key=lambda x: len(x), reverse=True)[0])
# raise NotImplementedError()
Question 5
How many nodes are in the largest strongly connected component?
This function should return an int
def answer_five():
# YOUR CODE HERE
G = answer_one()
return len(sorted(nx.strongly_connected_components(G), key=lambda x: len(x), reverse=True)[0])
# raise NotImplementedError()
Question 6
Using the NetworkX functions strongly_connected_components
and subgraph
, find the subgraph of nodes in the largest strongly connected component. Call this graph G_sc.
This function should return a networkx MultiDiGraph named G_sc.
def answer_six():
# YOUR CODE HERE
G = answer_one()
nodes_sets = nx.strongly_connected_components(G)
num_max_nodes = 0
for nodes_set in nodes_sets:
num_nodes = len(nodes_set)
if num_nodes > num_max_nodes:
num_max_nodes = num_nodes
nodes = nodes_set
G_sc = nx.subgraph(G, nodes)
return G_sc
# raise NotImplementedError()
Question 7
What is the average distance between nodes in G_sc?
This function should return a float.
def answer_seven():
# YOUR CODE HERE
G_sc = answer_six()
return nx.average_shortest_path_length(G_sc)
# raise NotImplementedError()
Question 8
What is the largest possible distance between two employees in G_sc?
This function should return an int.
def answer_eight():
# YOUR CODE HERE
G_sc = answer_six()
return nx.diameter(G_sc)
# raise NotImplementedError()
Question 9
What is the set of nodes in G_sc with eccentricity equal to the diameter?
This function should return a set of the node(s).
def answer_nine():
# YOUR CODE HERE
G_sc = answer_six()
return set(nx.periphery(G_sc))
# raise NotImplementedError()
Question 10
What is the set of node(s) in G_sc with eccentricity equal to the radius?
This function should return a set of the node(s).
def answer_ten():
# YOUR CODE HERE
G_sc = answer_six()
return set(nx.center(G_sc))
# raise NotImplementedError()
Question 11
Which node in G_sc has the most shortest paths to other nodes whose distance equal the diameter of G_sc?
For the node with the most such shortest paths, how many of these paths are there?
This function should return a tuple (name of node, number of paths).
def answer_eleven():
# YOUR CODE HERE
G_sc = answer_six()
diameter = answer_eight()
nodes = answer_nine()
max_shortest_paths = 0
node_with_max_shortest_paths = None
for node in nodes:
shortest_paths = 0
for target_node in G_sc.nodes():
if node != target_node:
shortest_path_length = nx.shortest_path_length(G_sc, node, target_node)
if shortest_path_length == diameter:
shortest_paths += 1
if shortest_paths > max_shortest_paths:
max_shortest_paths = shortest_paths
node_with_max_shortest_paths = node
return node_with_max_shortest_paths, max_shortest_paths
# raise NotImplementedError()
Question 12
Suppose you want to prevent communication flow from the node that you found in question 11 to node 10. What is the smallest number of nodes you would need to remove from the graph (you're not allowed to remove the node from the previous question or 10)?
This function should return an integer.
def answer_twelve():
# YOUR CODE HERE
G_sc = answer_six()
node = answer_eleven()[0]
return len(nx.minimum_node_cut(G_sc, node, '10'))
# raise NotImplementedError()
Question 13
Convert the graph G_sc into an undirected graph by removing the direction of the edges of G_sc. Call the new graph G_un.
This function should return a networkx Graph.
def answer_thirteen():
# YOUR CODE HERE
G_un = nx.Graph(answer_six().to_undirected())
return G_un
# raise NotImplementedError()
Question 14
What is the transitivity and average clustering coefficient of graph G_un?
This function should return a tuple (transitivity, avg clustering).
Note: DO NOT round up your answer.
def answer_fourteen():
# YOUR CODE HERE
G_un = answer_thirteen()
transitivity = nx.transitivity(G_un)
avg_clustering = nx.average_clustering(G_un)
return transitivity, avg_clustering
# raise NotImplementedError()
Module 3: Influence Measures and Network Centralization
Assignment 3
In this assignment you will explore measures of centrality on two networks, a friendship network in Part 1, and a blog network in Part 2.
Part 1
Answer questions 1-4 using the network G1
, a network of friendships at a university department. Each node corresponds to a person, and an edge indicates friendship.
The network has been loaded as networkx graph object G1
.
import networkx as nx
G1 = nx.read_gml('assets/friendships.gml')
Question 1
Find the degree centrality, closeness centrality, and betweeness centrality of node 100.
This function should return a tuple of floats (degree_centrality, closeness_centrality, betweenness_centrality)
.
def answer_one():
# YOUR CODE HERE
degree_centrality = nx.degree_centrality(G1)[100]
closeness_centrality = nx.closeness_centrality(G1)[100]
betweenness_centrality = nx.betweenness_centrality(G1)[100]
return degree_centrality, closeness_centrality, betweenness_centrality
# raise NotImplementedError()
Use centrality measures to answer questions 2-4
Question 2
Suppose you are employed by an online shopping website and are tasked with selecting one user in network G1 to send an online shopping voucher to. We expect that the user who receives the voucher will send it to their friends in the network. You want the voucher to reach as many nodes as possible. The voucher can be forwarded to multiple users at the same time, but the travel distance of the voucher is limited to one step, which means if the voucher travels more than one step in this network, it is no longer valid. Apply your knowledge in network centrality to select the best candidate for the voucher.
This function should return an integer, the chosen node.
def answer_two():
# YOUR CODE HERE
degree_centrality = nx.degree_centrality(G1)
return sorted(degree_centrality, key=lambda x: degree_centrality[x], reverse=True)[0]
# raise NotImplementedError()
Question 3
Now the limit of the voucher’s travel distance has been removed. Because the network is connected, regardless of who you pick, every node in the network will eventually receive the voucher. However, we now want to ensure that the voucher reaches nodes as quickly as possible (i.e. in the fewest number of hops). How will you change your selection strategy? Write a function to tell us who is the best candidate in the network under this condition.
This function should return an integer, the chosen node.
def answer_three():
# YOUR CODE HERE
closeness_centrality = nx.closeness_centrality(G1)
return sorted(closeness_centrality, key=lambda x: closeness_centrality[x], reverse=True)[0]
# raise NotImplementedError()
Question 4
Assume the restriction on the voucher’s travel distance is still removed, but now a competitor has developed a strategy to remove a person from the network in order to disrupt the distribution of your company’s voucher. You competitor plans to remove people who act as bridges in the network. Identify the best possible person to be removed by your competitor?
This function should return an integer, the chosen node.
def answer_four():
# YOUR CODE HERE
betweenness_centrality = nx.betweenness_centrality(G1)
return sorted(betweenness_centrality, key=lambda x: betweenness_centrality[x], reverse=True)[0]
# raise NotImplementedError()
Part 2
G2
is a directed network of political blogs, where nodes correspond to a blog and edges correspond to links between blogs. Use your knowledge of PageRank and HITS to answer Questions 5-9.
G2 = nx.read_gml('assets/blogs.gml')
Question 5
Apply the Scaled Page Rank Algorithm to this network. Find the Page Rank of node 'realclearpolitics.com' with damping value 0.85.
This function should return a float.
def answer_five():
# YOUR CODE HERE
return nx.pagerank(G2, alpha=.85)['realclearpolitics.com']
# raise NotImplementedError()
Question 6
Apply the Scaled Page Rank Algorithm to this network with damping value 0.85. Find the 5 nodes with highest Page Rank.
This function should return a list of the top 5 blogs in desending order of Page Rank.
def answer_six():
# YOUR CODE HERE
pagerank = nx.pagerank(G2, alpha=.85)
return sorted(pagerank, key=lambda x: pagerank[x], reverse=True)[:5]
# raise NotImplementedError()
Question 7
Apply the HITS Algorithm to the network to find the hub and authority scores of node 'realclearpolitics.com'.
Your result should return a tuple of floats (hub_score, authority_score)
.
def answer_seven():
# YOUR CODE HERE
hub_scores, authority_scores = nx.hits(G2, normalized=True)
return hub_scores['realclearpolitics.com'], authority_scores['realclearpolitics.com']
# raise NotImplementedError()
Question 8
Apply the HITS Algorithm to this network to find the 5 nodes with highest hub scores.
This function should return a list of the top 5 blogs in desending order of hub scores.
def answer_eight():
# YOUR CODE HERE
hub_scores, authority_scores = nx.hits(G2, normalized=True)
return sorted(hub_scores, key=lambda x: hub_scores[x], reverse=True)[:5]
# raise NotImplementedError()
Question 9
Apply the HITS Algorithm to this network to find the 5 nodes with highest authority scores.
This function should return a list of the top 5 blogs in desending order of authority scores.
def answer_nine():
# YOUR CODE HERE
hub_scores, authority_scores = nx.hits(G2, normalized=True)
return sorted(authority_scores, key=lambda x: authority_scores[x], reverse=True)[:5]
# raise NotImplementedError()
Module 4: Applications
Assignment 4
import networkx as nx
import pandas as pd
import numpy as np
import pickle
Part 1 - Random Graph Identification
For the first part of this assignment you will analyze randomly generated graphs and determine which algorithm created them.
G1 = nx.read_gpickle("assets/A4_P1_G1")
G2 = nx.read_gpickle("assets/A4_P1_G2")
G3 = nx.read_gpickle("assets/A4_P1_G3")
G4 = nx.read_gpickle("assets/A4_P1_G4")
G5 = nx.read_gpickle("assets/A4_P1_G5")
P1_Graphs = [G1, G2, G3, G4, G5]
`P1_Graphs` is a list containing 5 networkx graphs. Each of these graphs were generated by one of three possible algorithms: * Preferential Attachment (`'PA'`) * Small World with low probability of rewiring (`'SW_L'`) * Small World with high probability of rewiring (`'SW_H'`)
Anaylze each of the 5 graphs using any methodology and determine which of the three algorithms generated each graph.
The graph_identification
function should return a list of length 5 where each element in the list is either 'PA'
, 'SW_L'
, or 'SW_H'
.
def graph_identification():
# YOUR CODE HERE
'''import matplotlib.pyplot as plt
for graph in P1_Graphs:
degrees = dict(graph.degree())
degree_values = sorted(set(degrees.values()))
histogram = [list(degrees.values()).count(i)/float(nx.number_of_nodes(graph)) for i in degree_values]
plt.bar(degree_values, histogram)
plt.xlabel('Degree')
plt.ylabel('Fraction of Nodes')
plt.show()'''
algorithms = []
for graph in P1_Graphs:
if nx.average_clustering(graph) <= .2:
algorithms.append('PA')
else:
avg_path = nx.average_shortest_path_length(graph)
if avg_path >= 30:
algorithms.append('SW_L')
else:
algorithms.append('SW_H')
return algorithms
# raise NotImplementedError()
graph_identification()
注:本题也可以通过注释中的作图方式根据直方图来判断,例如度分布是否有Power Law的趋势、是否符合正态分布等。这里的阈值(0.2和30)仅适用于当前问题,并非所有问题的通解。
Part 2 - Company Emails
For the second part of this assignment you will be working with a company's email network where each node corresponds to a person at the company, and each edge indicates that at least one email has been sent between two people.
The network also contains the node attributes Department
and ManagmentSalary
.
Department
indicates the department in the company which the person belongs to, and ManagmentSalary
indicates whether that person is receiving a managment position salary.
G = pickle.load(open('assets/email_prediction_NEW.txt', 'rb'))
print(f"Graph with {len(nx.nodes(G))} nodes and {len(nx.edges(G))} edges")
Part 2A - Salary Prediction
Using network G
, identify the people in the network with missing values for the node attribute ManagementSalary
and predict whether or not these individuals are receiving a managment position salary.
To accomplish this, you will need to create a matrix of node features of your choice using networkx, train a sklearn classifier on nodes that have ManagementSalary
data, and predict a probability of the node receiving a managment salary for nodes where ManagementSalary
is missing.
Your predictions will need to be given as the probability that the corresponding employee is receiving a managment position salary.
The evaluation metric for this assignment is the Area Under the ROC Curve (AUC).
Your grade will be based on the AUC score computed for your classifier. A model which with an AUC of 0.75 or higher will recieve full points.
Using your trained classifier, return a Pandas series of length 252 with the data being the probability of receiving managment salary, and the index being the node id.
Example:
1 1.0
2 0.0
5 0.8
8 1.0
...
996 0.7
1000 0.5
1001 0.0
Length: 252, dtype: float64
list(G.nodes(data=True))[:5] # print the first 5 nodes
def salary_predictions():
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
# YOUR CODE HERE
df = pd.DataFrame(index=G.nodes())
df['ManagementSalary'] = pd.Series(nx.get_node_attributes(G, 'ManagementSalary'))
df['Department'] = pd.Series(nx.get_node_attributes(G, 'Department'))
df['Clustering'] = pd.Series(nx.clustering(G))
df['DegreeCentrality'] = pd.Series(nx.degree_centrality(G))
df['BetweennessCentrality'] = pd.Series(nx.betweenness_centrality(G))
df['ClosenessCentrality'] = pd.Series(nx.closeness_centrality(G))
df['PageRank'] = pd.Series(nx.pagerank(G))
df['Hub'], df['Authority'] = pd.Series(nx.hits(G))
X, y = df.iloc[:, 1:], df.iloc[:, 0]
X_train, y_train = X.loc[~np.isnan(y), :], y[~np.isnan(y)]
X_test, y_test = X.loc[np.isnan(y), :], y[np.isnan(y)]
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
predict_proba = pd.Series(rf.predict_proba(X_test)[:, 1], index=X_test.index)
return predict_proba
# raise NotImplementedError()
Part 2B - New Connections Prediction
For the last part of this assignment, you will predict future connections between employees of the network. The future connections information has been loaded into the variable future_connections
. The index is a tuple indicating a pair of nodes that currently do not have a connection, and the Future Connection
column indicates if an edge between those two nodes will exist in the future, where a value of 1.0 indicates a future connection.
future_connections = pd.read_csv('assets/Future_Connections.csv', index_col=0, converters={0: eval})
future_connections.head(10)
Using network G
and future_connections
, identify the edges in future_connections
with missing values and predict whether or not these edges will have a future connection.
To accomplish this, you will need to:
- Create a matrix of features of your choice for the edges found in
future_connections
using Networkx - Train a sklearn classifier on those edges in
future_connections
that haveFuture Connection
data - Predict a probability of the edge being a future connection for those edges in
future_connections
whereFuture Connection
is missing.
Your predictions will need to be given as the probability of the corresponding edge being a future connection.
The evaluation metric for this assignment is the Area Under the ROC Curve (AUC).
Your grade will be based on the AUC score computed for your classifier. A model which with an AUC of 0.75 or higher will recieve full points.
Using your trained classifier, return a series of length 122112 with the data being the probability of the edge being a future connection, and the index being the edge as represented by a tuple of nodes.
Example:
(107, 348) 0.35
(542, 751) 0.40
(20, 426) 0.55
(50, 989) 0.35
...
(939, 940) 0.15
(555, 905) 0.35
(75, 101) 0.65
Length: 122112, dtype: float64
def new_connections_predictions():
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
# YOUR CODE HERE
df = future_connections.copy()
H = G.copy()
for node in H.nodes():
H.nodes[node]['community'] = H.nodes[node]['Department']
index_to_Series = lambda data: pd.Series({(x, y): z for x, y, z in data})
df['Common Neighbors'] = index_to_Series([(e[0], e[1], len(list(nx.common_neighbors(G, e[0], e[1])))) for e in nx.non_edges(H)])
df['Jaccard Coefficient'] = index_to_Series(nx.jaccard_coefficient(H))
df['Resource Allocation'] = index_to_Series(nx.resource_allocation_index(H))
df['Preferential Attachment'] = index_to_Series(nx.preferential_attachment(H))
df['Community Common Neighbors'] = index_to_Series(nx.cn_soundarajan_hopcroft(H))
df['Community Resource Allocation'] = index_to_Series(nx.ra_index_soundarajan_hopcroft(H))
X, y = df.iloc[:, 1:], df.iloc[:, 0]
X_train, y_train = X.loc[~np.isnan(y), :], y[~np.isnan(y)]
X_test, y_test = X.loc[np.isnan(y), :], y[np.isnan(y)]
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
predict_proba = pd.Series(rf.predict_proba(X_test)[:, 1], index=X_test.index)
return predict_proba
# raise NotImplementedError()