大数据分析实验3
这是一个大佬同学基于新版本的库:https://github.com/WuZhenqing/HIT-BigDataAnalysisLab3
以下为我基于老版本的库
作者:lzq
目录
- 大数据分析实验3
- 注意事项
- 实验环境
- neo4j基本配置
- 文件说明(等号后为文件重命名之后的名称)
- 节点数据导入(包含圈子属性)
- 边数据导入
- 中心点信息导入
- 数据导入代码总览
- 查询操作
- 1.检索所有gender属性为77且education;degree;id为20的Person;
- 2.检索所有gender属性为78且education;degree;id为20或22的Person;
- 3.为Person增设年龄age属性,数值自行设定,可以随机化,要求年龄介于18岁-30岁之间,尽量分布均匀;
- 4.检索每个Person的朋友的数量;
- 5.检索朋友平均年龄值在25岁以下的Person集合;
- 6.检索年龄最大的前10个Person;
- 7.删除所有年龄为18和19的Person;
- 8.检索某个Person的所有朋友和这些朋友的所有朋友;
- 9.检索某个Person的所有朋友集合和其所在的circle的所有Person集合;
- 10.任选三对Person,查找每一对Person间的最短关系链(即图模型的最短路);
- 11.对于人数少于两个的circle,删除掉这些circle里的Person的表示circle信息的属性;
- 12.按年龄升序排序所有Person后,再按hometown;id属性的字符串值降序排序,然后返回第5、6、 7、8、9、10名Person,由于一些节点的hometown;id可能是空的(即没有这个属性),对于null值的节点要从排序列表里去掉;
- 13.检索某个Person的二级和三级朋友集合(A的直接朋友(即有边连接)的称之为一级朋友,A的N级朋友的朋友称之为N+1级朋友,主要通过路径长度来区分,即A的N级朋友与A的所有路径中,有一条长度为N);
- 14.获取某个Person的所有朋友的education;school; id属性的list;
- 15.任选三对Person,查找每一对Person的关系路径中长度小于10的那些路径,检索出这些路径上年龄大于22的Person集合,在这一查询中,由于数据量及Person的选取问题,可能导致该查询难以计算出结果,因此可以将10这一数字下调至可计算的程度(自行决定,但请保证>=2),或者更换Person对;
注意事项
实验中部分查询条件可能因为年龄数据的随机性和实验数据删除的随机性发生变化,请大家灵活修改。
实验环境
java:jdk11
ne4j:neo4j-community-4.4.20-windows
python依赖库:py2neo==4.3.0
其中依赖库下载命令如下:
pip install py2neo==4.3.0 -i https://pypi.douban.com/simple
注意: 实验环境的版本非常重要,请选择正确的版本
neo4j基本配置
neo4j安装:neo4j的安装非常简单,只需要将neo4j压缩包解压后将其bin目录添加到环境变量即可
neo4j启动命令:
neo4j.bat console
neo4j第一次启动后的账户名是neo4j,密码也是neo4j,之后会让用户自己设置一个秘密,千万要记住,否则就要重新安装了(虽然不难)
neo4j查询所有节点:
MATCH (n) RETURN (n)
neo4j删除所有节点:
match (n) detach delete (n)
nei4j将所有关系转换为无向关系:
MATCH ()-[r]->()
SET r:UNDIRECTED
WITH startNode(r) AS from, endNode(r) AS to, type(r) AS type
CREATE (to)-[:UNDIRECTED {type: type}]->(from)
文件说明(等号后为文件重命名之后的名称)
0.featnames=data1_featnames.txt
属性名及属性信息,主要是为单热点编码提供特征匹配依据。
0.feat=data1_feat.txt
各个节点特征的单热点编码,在数据导入中要转化为节点特征。
0.edges=data1_edges.txt
朋友关系的边属性信息,其中关系为无向关系,注意的是neo4j中的id为一个不可变的自增长整数,所以边和圈子文件中的id比实际id大1
0.circles=data1_circles.txt
圈子关系,其中我们需要循环创建一个首位连接的圈子关系(后来我将圈子关系修改为一个属性值,这样方便查询)
0.egofeat=data1_egofeat.txt
中心节点的属性信息(这个节点表示的人认识所有人)
节点数据导入(包含圈子属性)
代码如下:
def main():
join_str = '_'
#设置特征名拼接符
count1 = len(open("data1_featnames.txt", 'r').readlines())
featnames = open("data1_featnames.txt", "r")
feature_data = pd.DataFrame(columns=['num', 'fea_name', 'fea_data'])
fea_count = []
#遍历所有节点属性
for i in range(count1):
contents = featnames.readline()
contents_list = contents.split(';')
if len(contents_list) == 2:
first = contents_list[0].split(' ')
second = contents_list[1].split(' ')
feature_data = feature_data.append({'num': int(first[0]), 'fea_name': first[1], 'fea_data': int(second[2])},
ignore_index=True)
if first[1] not in fea_count:
fea_count.append(first[1])
elif len(contents_list) == 3:
first = contents_list[0].split(' ')
third = contents_list[2].split(' ')
feature_data = feature_data.append({'num': int(first[0]), 'fea_name': first[1] + join_str + contents_list[1],
'fea_data': int(third[2])}, ignore_index=True)
if (first[1] + join_str + contents_list[1]) not in fea_count:
fea_count.append(first[1] + join_str + contents_list[1])
elif len(contents_list) == 4:
first = contents_list[0].split(' ')
fourth = contents_list[3].split(' ')
feature_data = feature_data.append({'num': int(first[0]), 'fea_name': first[1] + join_str + contents_list[1] + join_str + contents_list[2],
'fea_data': int(fourth[2])}, ignore_index=True)
if (first[1] + join_str + contents_list[1] + join_str + contents_list[2]) not in fea_count:
fea_count.append(first[1] + join_str + contents_list[1] + join_str + contents_list[2])
count2 = len(open("data1_feat.txt", 'r').readlines())
feaboolean = open("data1_feat.txt", "r")
graph = Graph('http://localhost:7474/', username='neo4j', password='你的密码')
#记录圈子属性
node_circle_list = []
for i in range(count2):
node_circle = []
node_circle_list.append(node_circle)
circles = open("data1_circles.txt",'r')
count4 = len(open("data1_circles.txt",'r').readlines())
for i in range(count4):
contents = circles.readline()
contents_list = contents.split('\t')
content_len = len(contents_list)
for j in range(1, content_len):
node_circle_list[int(contents_list[j])-1].append(contents_list[0])
#节点集合,在之后的边创建中会用到
node_list = []
#对节点进行映射,同时创建节点,好处是值为空的属性会被自动删除
for i in range(count2):
fea_dict = {key: None for key in fea_count}
contents = feaboolean.readline()
contents_list = contents.split(' ')
for j in range(1, len(contents_list)):
if contents_list[j-1] == '1':
fea_dict[feature_data.loc[j-1, "fea_name"]] = feature_data.loc[j-1, "fea_data"]
myNode = Node("Person", birthday=fea_dict['birthday'], education_classes_id=fea_dict['education_classes_id'],
education_concentration_id=fea_dict['education_concentration_id'],
education_degree_id=fea_dict['education_degree_id'],education_school_id=fea_dict['education_school_id'],
education_type=fea_dict['education_type'],
education_with_id=fea_dict['education_with_id'],education_year_id=fea_dict['education_year_id'],
first_name=fea_dict['first_name'],gender=fea_dict['gender'],
hometown_id=fea_dict['hometown_id'],languages_id=fea_dict['languages_id'],
last_name=fea_dict['last_name'],locale=fea_dict['locale'],location_id=fea_dict['location_id'],
work_employer_id=fea_dict['work_employer_id'],work_end_date=fea_dict['work_end_date'],
work_location_id=fea_dict['work_location_id'],work_position_id=fea_dict['work_position_id'],
work_start_date=fea_dict['work_start_date'],work_with_id=fea_dict['work_with_id'],
circle = node_circle_list[i])
node_list.append(myNode)
graph.create(myNode)
graph.create方法只在老版本支持,所以我们要注意版本。 在代码中,我们创建了一个dataframe类型的数据值索引库,之后我们记录节点的圈子信息,我们为每一个节点都创建了一个专门用来记录圈子信息的列表,将圈子信息取出加入列表中。然后用节点属性的单热点编码来取出所用的值,创建节点后加入数据库中,其中每个节点的名称均为Person,属性值包含在数据集中(具体要求详见实验指导),注意我们定义了一个节点列表,这在之后的关系创建中会用到。
边数据导入
有了节点导入的基础,边数据的导入就很好实现了,代码如下所示
edges = open("data1_edges.txt", "r")
count3 = len(open("data1_edges.txt", 'r').readlines())
for i in range(count3):
contents = edges.readline()
contents_list = contents.split(' ')
first = int(contents_list[0])
second = int(contents_list[1])
relationship = Relationship(node_list[first-1],"Be_Friend_With",node_list[second-1])
relationship["undirected"] = True
graph.create(relationship)
我们取出边特征之后创建关系,然后将关系修改为无向关系即可,之后将关系加入数据库中。
中心点信息导入
#导入中心点信息
center_data = open("data1_egofeat.txt",'r').read()
count5 = len(open("data1_feat.txt", 'r').readlines())
feaboolean = open("data1_feat.txt", "r")
center_node = center_data.split(' ')
center_id = 0
#寻找中心点是否包含在节点列表里
for i in range(count5):
contents = feaboolean.readline()
contents_list = contents.split(' ')
flag = True
for j in range(1, len(contents_list)):
if int(contents_list[j]) != int(center_node[j-1]):
flag = False
break
if flag == True:
center_id = i
print("The center_id is ",i)
break
fea_dict = {key: None for key in fea_count}
for j in range(len(center_node)):
if center_node[j] == '1':
fea_dict[feature_data.loc[j, "fea_name"]] = feature_data.loc[j, "fea_data"]
CenterNode = Node("Person", birthday=fea_dict['birthday'], education_classes_id=fea_dict['education_classes_id'],
education_concentration_id=fea_dict['education_concentration_id'],
education_degree_id=fea_dict['education_degree_id'],
education_school_id=fea_dict['education_school_id'],
education_type=fea_dict['education_type'],
education_with_id=fea_dict['education_with_id'], education_year_id=fea_dict['education_year_id'],
first_name=fea_dict['first_name'], gender=fea_dict['gender'],
hometown_id=fea_dict['hometown_id'], languages_id=fea_dict['languages_id'],
last_name=fea_dict['last_name'], locale=fea_dict['locale'], location_id=fea_dict['location_id'],
work_employer_id=fea_dict['work_employer_id'], work_end_date=fea_dict['work_end_date'],
work_location_id=fea_dict['work_location_id'], work_position_id=fea_dict['work_position_id'],
work_start_date=fea_dict['work_start_date'], work_with_id=fea_dict['work_with_id'])
graph.create(CenterNode)
for i in range(count2):
relationship = Relationship(node_list[i], "Be_Friend_With", CenterNode)
relationship["undirected"] = True
graph.create(relationship)
main()
中心点的属性导入完成后,只需要将他和所有人建立朋友关系即可。
数据导入代码总览
import pandas as pd
import os
from py2neo import Node, Relationship, Graph, NodeMatcher, RelationshipMatcher
def main():
join_str = '_'
#设置特征名拼接符
count1 = len(open("data1_featnames.txt", 'r').readlines())
featnames = open("data1_featnames.txt", "r")
feature_data = pd.DataFrame(columns=['num', 'fea_name', 'fea_data'])
fea_count = []
#遍历所有节点属性
for i in range(count1):
contents = featnames.readline()
contents_list = contents.split(';')
if len(contents_list) == 2:
first = contents_list[0].split(' ')
second = contents_list[1].split(' ')
feature_data = feature_data.append({'num': int(first[0]), 'fea_name': first[1], 'fea_data': int(second[2])},
ignore_index=True)
if first[1] not in fea_count:
fea_count.append(first[1])
elif len(contents_list) == 3:
first = contents_list[0].split(' ')
third = contents_list[2].split(' ')
feature_data = feature_data.append({'num': int(first[0]), 'fea_name': first[1] + join_str + contents_list[1],
'fea_data': int(third[2])}, ignore_index=True)
if (first[1] + join_str + contents_list[1]) not in fea_count:
fea_count.append(first[1] + join_str + contents_list[1])
elif len(contents_list) == 4:
first = contents_list[0].split(' ')
fourth = contents_list[3].split(' ')
feature_data = feature_data.append({'num': int(first[0]), 'fea_name': first[1] + join_str + contents_list[1] + join_str + contents_list[2],
'fea_data': int(fourth[2])}, ignore_index=True)
if (first[1] + join_str + contents_list[1] + join_str + contents_list[2]) not in fea_count:
fea_count.append(first[1] + join_str + contents_list[1] + join_str + contents_list[2])
count2 = len(open("data1_feat.txt", 'r').readlines())
feaboolean = open("data1_feat.txt", "r")
graph = Graph('http://localhost:7474/', username='neo4j', password='你的密码')
#记录圈子属性
node_circle_list = []
for i in range(count2):
node_circle = []
node_circle_list.append(node_circle)
circles = open("data1_circles.txt",'r')
count4 = len(open("data1_circles.txt",'r').readlines())
for i in range(count4):
contents = circles.readline()
contents_list = contents.split('\t')
content_len = len(contents_list)
for j in range(1, content_len):
node_circle_list[int(contents_list[j])-1].append(contents_list[0])
#节点集合,在之后的边创建中会用到
node_list = []
#对节点进行映射,同时创建节点,好处是值为空的属性会被自动删除
for i in range(count2):
fea_dict = {key: None for key in fea_count}
contents = feaboolean.readline()
contents_list = contents.split(' ')
for j in range(1, len(contents_list)):
if contents_list[j-1] == '1':
fea_dict[feature_data.loc[j-1, "fea_name"]] = feature_data.loc[j-1, "fea_data"]
myNode = Node("Person", birthday=fea_dict['birthday'], education_classes_id=fea_dict['education_classes_id'],
education_concentration_id=fea_dict['education_concentration_id'],
education_degree_id=fea_dict['education_degree_id'],education_school_id=fea_dict['education_school_id'],
education_type=fea_dict['education_type'],
education_with_id=fea_dict['education_with_id'],education_year_id=fea_dict['education_year_id'],
first_name=fea_dict['first_name'],gender=fea_dict['gender'],
hometown_id=fea_dict['hometown_id'],languages_id=fea_dict['languages_id'],
last_name=fea_dict['last_name'],locale=fea_dict['locale'],location_id=fea_dict['location_id'],
work_employer_id=fea_dict['work_employer_id'],work_end_date=fea_dict['work_end_date'],
work_location_id=fea_dict['work_location_id'],work_position_id=fea_dict['work_position_id'],
work_start_date=fea_dict['work_start_date'],work_with_id=fea_dict['work_with_id'],
circle = node_circle_list[i])
node_list.append(myNode)
graph.create(myNode)
edges = open("data1_edges.txt", "r")
count3 = len(open("data1_edges.txt", 'r').readlines())
for i in range(count3):
contents = edges.readline()
contents_list = contents.split(' ')
first = int(contents_list[0])
second = int(contents_list[1])
relationship = Relationship(node_list[first-1],"Be_Friend_With",node_list[second-1])
relationship["undirected"] = True
graph.create(relationship)
#导入中心点信息
center_data = open("data1_egofeat.txt",'r').read()
count5 = len(open("data1_feat.txt", 'r').readlines())
feaboolean = open("data1_feat.txt", "r")
center_node = center_data.split(' ')
center_id = 0
#寻找中心点是否包含在节点列表里
for i in range(count5):
contents = feaboolean.readline()
contents_list = contents.split(' ')
flag = True
for j in range(1, len(contents_list)):
if int(contents_list[j]) != int(center_node[j-1]):
flag = False
break
if flag == True:
center_id = i
print("The center_id is ",i)
break
fea_dict = {key: None for key in fea_count}
for j in range(len(center_node)):
if center_node[j] == '1':
fea_dict[feature_data.loc[j, "fea_name"]] = feature_data.loc[j, "fea_data"]
CenterNode = Node("Person", birthday=fea_dict['birthday'], education_classes_id=fea_dict['education_classes_id'],
education_concentration_id=fea_dict['education_concentration_id'],
education_degree_id=fea_dict['education_degree_id'],
education_school_id=fea_dict['education_school_id'],
education_type=fea_dict['education_type'],
education_with_id=fea_dict['education_with_id'], education_year_id=fea_dict['education_year_id'],
first_name=fea_dict['first_name'], gender=fea_dict['gender'],
hometown_id=fea_dict['hometown_id'], languages_id=fea_dict['languages_id'],
last_name=fea_dict['last_name'], locale=fea_dict['locale'], location_id=fea_dict['location_id'],
work_employer_id=fea_dict['work_employer_id'], work_end_date=fea_dict['work_end_date'],
work_location_id=fea_dict['work_location_id'], work_position_id=fea_dict['work_position_id'],
work_start_date=fea_dict['work_start_date'], work_with_id=fea_dict['work_with_id'])
graph.create(CenterNode)
for i in range(count2):
relationship = Relationship(node_list[i], "Be_Friend_With", CenterNode)
relationship["undirected"] = True
graph.create(relationship)
main()
查询操作
1.检索所有gender属性为77且education;degree;id为20的Person;
MATCH (p:Person)
WHERE p.gender = 77 AND p.education_degree_id = 20
RETURN p
2.检索所有gender属性为78且education;degree;id为20或22的Person;
MATCH (p:Person)
WHERE p.gender = 78 AND p.education_degree_id IN [20, 22]
RETURN p
3.为Person增设年龄age属性,数值自行设定,可以随机化,要求年龄介于18岁-30岁之间,尽量分布均匀;
MATCH (p:Person)
SET p.age = toInteger(rand() * 13) + 18
4.检索每个Person的朋友的数量;
MATCH (p:Person)
OPTIONAL MATCH (p)-[:Be_Friend_With]-(friend)
RETURN p, COUNT(friend) AS friendCount
5.检索朋友平均年龄值在25岁以下的Person集合;
MATCH (p:Person)-[:Be_Friend_With]-(friend)
WITH p, avg(friend.age) AS avgFriendAge
WHERE avgFriendAge < 25
RETURN p, avgFriendAge
6.检索年龄最大的前10个Person;
MATCH (p:Person)
RETURN p
ORDER BY p.age DESC
LIMIT 10
7.删除所有年龄为18和19的Person;
MATCH (p:Person)
WHERE p.age = 18 OR p.age = 19
DETACH DELETE p
8.检索某个Person的所有朋友和这些朋友的所有朋友;
MATCH (p:Person {education_school_id:51,work_employer_id:148})-[:Be_Friend_With]-(friend)
OPTIONAL MATCH (friend)-[:Be_Friend_With]-(fof)
RETURN friend, COLLECT(fof) AS friendsOfFriends
9.检索某个Person的所有朋友集合和其所在的circle的所有Person集合;
MATCH (p:Person {education_school_id:51 ,work_employer_id:148})-[:Be_Friend_With]-(friend)
MATCH (person:Person {circle: p.circle})
RETURN COLLECT(DISTINCT friend) AS friends, COLLECT(DISTINCT person) AS circleMembers
10.任选三对Person,查找每一对Person间的最短关系链(即图模型的最短路);
MATCH path = shortestPath((p1:Person)-[:Be_Friend_With*]-(p2:Person))
WHERE p1.age = 25 AND p2.age = 21
RETURN p1, p2, length(path) AS shortestPathLength
LIMIT 1
UNION
MATCH path = shortestPath((p1:Person)-[:Be_Friend_With*]-(p2:Person))
WHERE p1.age = 25 AND p2.age = 24
RETURN p1, p2, length(path) AS shortestPathLength
LIMIT 1
UNION
MATCH path = shortestPath((p1:Person)-[:Be_Friend_With*]-(p2:Person))
WHERE p1.age = 21 AND p2.age = 24
RETURN p1, p2, length(path) AS shortestPathLength
LIMIT 1
11.对于人数少于两个的circle,删除掉这些circle里的Person的表示circle信息的属性;
MATCH (p:Person)
WITH p.circle AS circle, COUNT(*) AS count
WHERE count < 2
MATCH (p:Person {circle: circle})
REMOVE p.circle
12.按年龄升序排序所有Person后,再按hometown;id属性的字符串值降序排序,然后返回第5、6、 7、8、9、10名Person,由于一些节点的hometown;id可能是空的(即没有这个属性),对于null值的节点要从排序列表里去掉;
MATCH (p:Person)
WHERE EXISTS(p.hometown_id)
RETURN p
ORDER BY p.age ASC, p.hometown_id DESC
SKIP 4 LIMIT 6
13.检索某个Person的二级和三级朋友集合(A的直接朋友(即有边连接)的称之为一级朋友,A的N级朋友的朋友称之为N+1级朋友,主要通过路径长度来区分,即A的N级朋友与A的所有路径中,有一条长度为N);
MATCH (p:Person {age:25,work_employer_id:148})-[:Be_Friend_With]-(f1:Person) // 一级朋友
WITH DISTINCT f1, p
MATCH (f1)-[:Be_Friend_With]-(f2:Person) // 二级朋友
WHERE f2 <> f1 AND f2 <> p
WITH DISTINCT f2, f1, p
MATCH (f2)-[:Be_Friend_With]-(f3:Person) // 三级朋友
WHERE f3 <> f2 AND f3 <> f1 AND f3 <> p
RETURN DISTINCT f2,f3
14.获取某个Person的所有朋友的education;school; id属性的list;
MATCH (p:Person {education_school_id:51 ,work_employer_id:148})-[:Be_Friend_With]-(friend:Person)
RETURN COLLECT(friend.education_school_id) AS educationSchoolIds
15.任选三对Person,查找每一对Person的关系路径中长度小于10的那些路径,检索出这些路径上年龄大于22的Person集合,在这一查询中,由于数据量及Person的选取问题,可能导致该查询难以计算出结果,因此可以将10这一数字下调至可计算的程度(自行决定,但请保证>=2),或者更换Person对;
MATCH (p1:Person)-[:Be_Friend_With*2..10]-(p2:Person)
WHERE p1 <> p2
WITH p1, p2, ALLSHORTESTPATHS((p1)-[:Be_Friend_With*]-(p2)) AS paths
WITH p1, p2, LENGTH(paths[0]) AS pathLength
MATCH path = (p1)-[:Be_Friend_With*2..10]-(p2)
WHERE ALL(person IN NODES(path)[1..-1] WHERE person.age > 22)
WITH p1, p2, COLLECT(DISTINCT NODES(path)[1..-1]) AS persons
RETURN p1, p2, persons