哈工大大数据分析实验3-图数据分析

北境旅客

已于 2023-06-06 08:57:23 修改

阅读量1k

点赞数 7

文章标签：数据分析 neo4j 数据挖掘 linux 大数据

于 2023-05-29 20:21:03 首次发布

本文链接：https://blog.csdn.net/m0_52311811/article/details/130935743

版权

大数据分析实验3

这是一个大佬同学基于新版本的库：https://github.com/WuZhenqing/HIT-BigDataAnalysisLab3
以下为我基于老版本的库
作者：lzq

注意事项

实验中部分查询条件可能因为年龄数据的随机性和实验数据删除的随机性发生变化，请大家灵活修改。

实验环境

java：jdk11

ne4j：neo4j-community-4.4.20-windows

python依赖库：py2neo==4.3.0

其中依赖库下载命令如下：

pip install py2neo==4.3.0 -i https://pypi.douban.com/simple

注意： 实验环境的版本非常重要，请选择正确的版本

neo4j基本配置

neo4j安装：neo4j的安装非常简单，只需要将neo4j压缩包解压后将其bin目录添加到环境变量即可

neo4j启动命令：

neo4j.bat console

neo4j第一次启动后的账户名是neo4j，密码也是neo4j，之后会让用户自己设置一个秘密，千万要记住，否则就要重新安装了（虽然不难）

neo4j查询所有节点：

MATCH (n) RETURN (n)

neo4j删除所有节点：

match (n) detach delete (n)

nei4j将所有关系转换为无向关系：

MATCH ()-[r]->()
SET r:UNDIRECTED
WITH startNode(r) AS from, endNode(r) AS to, type(r) AS type
CREATE (to)-[:UNDIRECTED {type: type}]->(from)

文件说明(等号后为文件重命名之后的名称)

0.featnames=data1_featnames.txt

属性名及属性信息，主要是为单热点编码提供特征匹配依据。

0.feat=data1_feat.txt

各个节点特征的单热点编码，在数据导入中要转化为节点特征。

0.edges=data1_edges.txt

朋友关系的边属性信息，其中关系为无向关系，注意的是neo4j中的id为一个不可变的自增长整数，所以边和圈子文件中的id比实际id大1

0.circles=data1_circles.txt

圈子关系，其中我们需要循环创建一个首位连接的圈子关系（后来我将圈子关系修改为一个属性值，这样方便查询）

0.egofeat=data1_egofeat.txt

中心节点的属性信息（这个节点表示的人认识所有人）

节点数据导入（包含圈子属性）

代码如下：

def main():

    join_str = '_'
    #设置特征名拼接符
    count1 = len(open("data1_featnames.txt", 'r').readlines())
    featnames = open("data1_featnames.txt", "r")
    feature_data = pd.DataFrame(columns=['num', 'fea_name', 'fea_data'])
    fea_count = []

    #遍历所有节点属性
    for i in range(count1):
        contents = featnames.readline()
        contents_list = contents.split(';')
        if len(contents_list) == 2:
            first = contents_list[0].split(' ')
            second = contents_list[1].split(' ')
            feature_data = feature_data.append({'num': int(first[0]), 'fea_name': first[1], 'fea_data': int(second[2])},
                                               ignore_index=True)

            if first[1] not in fea_count:
                fea_count.append(first[1])

        elif len(contents_list) == 3:
            first = contents_list[0].split(' ')
            third = contents_list[2].split(' ')
            feature_data = feature_data.append({'num': int(first[0]), 'fea_name': first[1] + join_str + contents_list[1],
                                                'fea_data': int(third[2])}, ignore_index=True)

            if (first[1] + join_str + contents_list[1]) not in fea_count:
                fea_count.append(first[1] + join_str + contents_list[1])

        elif len(contents_list) == 4:
            first = contents_list[0].split(' ')
            fourth = contents_list[3].split(' ')
            feature_data = feature_data.append({'num': int(first[0]), 'fea_name': first[1] + join_str + contents_list[1] + join_str + contents_list[2],
                                                'fea_data': int(fourth[2])}, ignore_index=True)

            if (first[1] + join_str + contents_list[1] + join_str + contents_list[2]) not in fea_count:
                fea_count.append(first[1] + join_str + contents_list[1] + join_str + contents_list[2])

    count2 = len(open("data1_feat.txt", 'r').readlines())
    feaboolean = open("data1_feat.txt", "r")
    graph = Graph('http://localhost:7474/', username='neo4j', password='你的密码')
    
    #记录圈子属性
    node_circle_list = []
    for i in range(count2):
        node_circle = []
        node_circle_list.append(node_circle)

    circles = open("data1_circles.txt",'r')
    count4 = len(open("data1_circles.txt",'r').readlines())

    for i in range(count4):
        contents = circles.readline()
        contents_list = contents.split('\t')
        content_len = len(contents_list)
        for j in range(1, content_len):
            node_circle_list[int(contents_list[j])-1].append(contents_list[0])

    #节点集合，在之后的边创建中会用到
    node_list = []

    #对节点进行映射，同时创建节点，好处是值为空的属性会被自动删除
    for i in range(count2):
        fea_dict = {key: None for key in fea_count}
        contents = feaboolean.readline()
        contents_list = contents.split(' ')
        for j in range(1, len(contents_list)):
            if contents_list[j-1] == '1':
                fea_dict[feature_data.loc[j-1, "fea_name"]] = feature_data.loc[j-1, "fea_data"]
        myNode = Node("Person", birthday=fea_dict['birthday'], education_classes_id=fea_dict['education_classes_id'],
                      education_concentration_id=fea_dict['education_concentration_id'],
                      education_degree_id=fea_dict['education_degree_id'],education_school_id=fea_dict['education_school_id'],
                      education_type=fea_dict['education_type'],
                      education_with_id=fea_dict['education_with_id'],education_year_id=fea_dict['education_year_id'],
                      first_name=fea_dict['first_name'],gender=fea_dict['gender'],
                      hometown_id=fea_dict['hometown_id'],languages_id=fea_dict['languages_id'],
                      last_name=fea_dict['last_name'],locale=fea_dict['locale'],location_id=fea_dict['location_id'],
                      work_employer_id=fea_dict['work_employer_id'],work_end_date=fea_dict['work_end_date'],
                      work_location_id=fea_dict['work_location_id'],work_position_id=fea_dict['work_position_id'],
                      work_start_date=fea_dict['work_start_date'],work_with_id=fea_dict['work_with_id'],
                      circle = node_circle_list[i])

        node_list.append(myNode)
        graph.create(myNode)

graph.create方法只在老版本支持，所以我们要注意版本。 在代码中，我们创建了一个dataframe类型的数据值索引库，之后我们记录节点的圈子信息，我们为每一个节点都创建了一个专门用来记录圈子信息的列表，将圈子信息取出加入列表中。然后用节点属性的单热点编码来取出所用的值，创建节点后加入数据库中，其中每个节点的名称均为Person，属性值包含在数据集中（具体要求详见实验指导），注意我们定义了一个节点列表，这在之后的关系创建中会用到。

边数据导入

有了节点导入的基础，边数据的导入就很好实现了，代码如下所示

    edges = open("data1_edges.txt", "r")
    count3 = len(open("data1_edges.txt", 'r').readlines())

    for i in range(count3):
        contents = edges.readline()
        contents_list = contents.split(' ')
        first = int(contents_list[0])
        second = int(contents_list[1])
        relationship = Relationship(node_list[first-1],"Be_Friend_With",node_list[second-1])
        relationship["undirected"] = True
        graph.create(relationship)

我们取出边特征之后创建关系，然后将关系修改为无向关系即可，之后将关系加入数据库中。

中心点信息导入

    #导入中心点信息
    center_data = open("data1_egofeat.txt",'r').read()
    count5 = len(open("data1_feat.txt", 'r').readlines())
    feaboolean = open("data1_feat.txt", "r")
    center_node = center_data.split(' ')
    center_id = 0

    #寻找中心点是否包含在节点列表里
    for i in range(count5):
        contents = feaboolean.readline()
        contents_list = contents.split(' ')
        flag = True
        for j in range(1, len(contents_list)):
            if int(contents_list[j]) != int(center_node[j-1]):
                flag = False
                break
        if flag == True:
            center_id = i
            print("The center_id is ",i)
            break

    fea_dict = {key: None for key in fea_count}
    for j in range(len(center_node)):
        if center_node[j] == '1':
            fea_dict[feature_data.loc[j, "fea_name"]] = feature_data.loc[j, "fea_data"]

    CenterNode = Node("Person", birthday=fea_dict['birthday'], education_classes_id=fea_dict['education_classes_id'],
                  education_concentration_id=fea_dict['education_concentration_id'],
                  education_degree_id=fea_dict['education_degree_id'],
                  education_school_id=fea_dict['education_school_id'],
                  education_type=fea_dict['education_type'],
                  education_with_id=fea_dict['education_with_id'], education_year_id=fea_dict['education_year_id'],
                  first_name=fea_dict['first_name'], gender=fea_dict['gender'],
                  hometown_id=fea_dict['hometown_id'], languages_id=fea_dict['languages_id'],
                  last_name=fea_dict['last_name'], locale=fea_dict['locale'], location_id=fea_dict['location_id'],
                  work_employer_id=fea_dict['work_employer_id'], work_end_date=fea_dict['work_end_date'],
                  work_location_id=fea_dict['work_location_id'], work_position_id=fea_dict['work_position_id'],
                  work_start_date=fea_dict['work_start_date'], work_with_id=fea_dict['work_with_id'])

    graph.create(CenterNode)

    for i in range(count2):
        relationship = Relationship(node_list[i], "Be_Friend_With", CenterNode)
        relationship["undirected"] = True
        graph.create(relationship)

main()

中心点的属性导入完成后，只需要将他和所有人建立朋友关系即可。

数据导入代码总览

import pandas as pd
import os
from py2neo import Node, Relationship, Graph, NodeMatcher, RelationshipMatcher

def main():

    join_str = '_'
    #设置特征名拼接符
    count1 = len(open("data1_featnames.txt", 'r').readlines())
    featnames = open("data1_featnames.txt", "r")
    feature_data = pd.DataFrame(columns=['num', 'fea_name', 'fea_data'])
    fea_count = []

    #遍历所有节点属性
    for i in range(count1):
        contents = featnames.readline()
        contents_list = contents.split(';')
        if len(contents_list) == 2:
            first = contents_list[0].split(' ')
            second = contents_list[1].split(' ')
            feature_data = feature_data.append({'num': int(first[0]), 'fea_name': first[1], 'fea_data': int(second[2])},
                                               ignore_index=True)

            if first[1] not in fea_count:
                fea_count.append(first[1])

        elif len(contents_list) == 3:
            first = contents_list[0].split(' ')
            third = contents_list[2].split(' ')
            feature_data = feature_data.append({'num': int(first[0]), 'fea_name': first[1] + join_str + contents_list[1],
                                                'fea_data': int(third[2])}, ignore_index=True)

            if (first[1] + join_str + contents_list[1]) not in fea_count:
                fea_count.append(first[1] + join_str + contents_list[1])

        elif len(contents_list) == 4:
            first = contents_list[0].split(' ')
            fourth = contents_list[3].split(' ')
            feature_data = feature_data.append({'num': int(first[0]), 'fea_name': first[1] + join_str + contents_list[1] + join_str + contents_list[2],
                                                'fea_data': int(fourth[2])}, ignore_index=True)

            if (first[1] + join_str + contents_list[1] + join_str + contents_list[2]) not in fea_count:
                fea_count.append(first[1] + join_str + contents_list[1] + join_str + contents_list[2])

    count2 = len(open("data1_feat.txt", 'r').readlines())
    feaboolean = open("data1_feat.txt", "r")
    graph = Graph('http://localhost:7474/', username='neo4j', password='你的密码')

    #记录圈子属性
    node_circle_list = []
    for i in range(count2):
        node_circle = []
        node_circle_list.append(node_circle)

    circles = open("data1_circles.txt",'r')
    count4 = len(open("data1_circles.txt",'r').readlines())

    for i in range(count4):
        contents = circles.readline()
        contents_list = contents.split('\t')
        content_len = len(contents_list)
        for j in range(1, content_len):
            node_circle_list[int(contents_list[j])-1].append(contents_list[0])

    #节点集合，在之后的边创建中会用到
    node_list = []

    #对节点进行映射，同时创建节点，好处是值为空的属性会被自动删除
    for i in range(count2):
        fea_dict = {key: None for key in fea_count}
        contents = feaboolean.readline()
        contents_list = contents.split(' ')
        for j in range(1, len(contents_list)):
            if contents_list[j-1] == '1':
                fea_dict[feature_data.loc[j-1, "fea_name"]] = feature_data.loc[j-1, "fea_data"]
        myNode = Node("Person", birthday=fea_dict['birthday'], education_classes_id=fea_dict['education_classes_id'],
                      education_concentration_id=fea_dict['education_concentration_id'],
                      education_degree_id=fea_dict['education_degree_id'],education_school_id=fea_dict['education_school_id'],
                      education_type=fea_dict['education_type'],
                      education_with_id=fea_dict['education_with_id'],education_year_id=fea_dict['education_year_id'],
                      first_name=fea_dict['first_name'],gender=fea_dict['gender'],
                      hometown_id=fea_dict['hometown_id'],languages_id=fea_dict['languages_id'],
                      last_name=fea_dict['last_name'],locale=fea_dict['locale'],location_id=fea_dict['location_id'],
                      work_employer_id=fea_dict['work_employer_id'],work_end_date=fea_dict['work_end_date'],
                      work_location_id=fea_dict['work_location_id'],work_position_id=fea_dict['work_position_id'],
                      work_start_date=fea_dict['work_start_date'],work_with_id=fea_dict['work_with_id'],
                      circle = node_circle_list[i])

        node_list.append(myNode)
        graph.create(myNode)

    edges = open("data1_edges.txt", "r")
    count3 = len(open("data1_edges.txt", 'r').readlines())

    for i in range(count3):
        contents = edges.readline()
        contents_list = contents.split(' ')
        first = int(contents_list[0])
        second = int(contents_list[1])
        relationship = Relationship(node_list[first-1],"Be_Friend_With",node_list[second-1])
        relationship["undirected"] = True
        graph.create(relationship)

    #导入中心点信息
    center_data = open("data1_egofeat.txt",'r').read()
    count5 = len(open("data1_feat.txt", 'r').readlines())
    feaboolean = open("data1_feat.txt", "r")
    center_node = center_data.split(' ')
    center_id = 0

    #寻找中心点是否包含在节点列表里
    for i in range(count5):
        contents = feaboolean.readline()
        contents_list = contents.split(' ')
        flag = True
        for j in range(1, len(contents_list)):
            if int(contents_list[j]) != int(center_node[j-1]):
                flag = False
                break
        if flag == True:
            center_id = i
            print("The center_id is ",i)
            break

    fea_dict = {key: None for key in fea_count}
    for j in range(len(center_node)):
        if center_node[j] == '1':
            fea_dict[feature_data.loc[j, "fea_name"]] = feature_data.loc[j, "fea_data"]

    CenterNode = Node("Person", birthday=fea_dict['birthday'], education_classes_id=fea_dict['education_classes_id'],
                  education_concentration_id=fea_dict['education_concentration_id'],
                  education_degree_id=fea_dict['education_degree_id'],
                  education_school_id=fea_dict['education_school_id'],
                  education_type=fea_dict['education_type'],
                  education_with_id=fea_dict['education_with_id'], education_year_id=fea_dict['education_year_id'],
                  first_name=fea_dict['first_name'], gender=fea_dict['gender'],
                  hometown_id=fea_dict['hometown_id'], languages_id=fea_dict['languages_id'],
                  last_name=fea_dict['last_name'], locale=fea_dict['locale'], location_id=fea_dict['location_id'],
                  work_employer_id=fea_dict['work_employer_id'], work_end_date=fea_dict['work_end_date'],
                  work_location_id=fea_dict['work_location_id'], work_position_id=fea_dict['work_position_id'],
                  work_start_date=fea_dict['work_start_date'], work_with_id=fea_dict['work_with_id'])

    graph.create(CenterNode)

    for i in range(count2):
        relationship = Relationship(node_list[i], "Be_Friend_With", CenterNode)
        relationship["undirected"] = True
        graph.create(relationship)

main()

查询操作

1.检索所有gender属性为77且education;degree;id为20的Person；

MATCH (p:Person)
WHERE p.gender = 77 AND p.education_degree_id = 20
RETURN p

2.检索所有gender属性为78且education;degree;id为20或22的Person；

MATCH (p:Person)
WHERE p.gender = 78 AND p.education_degree_id IN [20, 22]
RETURN p

3.为Person增设年龄age属性，数值自行设定，可以随机化，要求年龄介于18岁-30岁之间，尽量分布均匀；

MATCH (p:Person)
SET p.age = toInteger(rand() * 13) + 18

4.检索每个Person的朋友的数量；

MATCH (p:Person)
OPTIONAL MATCH (p)-[:Be_Friend_With]-(friend)
RETURN p, COUNT(friend) AS friendCount

5.检索朋友平均年龄值在25岁以下的Person集合；

MATCH (p:Person)-[:Be_Friend_With]-(friend)
WITH p, avg(friend.age) AS avgFriendAge
WHERE avgFriendAge < 25
RETURN p, avgFriendAge

6.检索年龄最大的前10个Person；

MATCH (p:Person)
RETURN p
ORDER BY p.age DESC
LIMIT 10

7.删除所有年龄为18和19的Person；

MATCH (p:Person)
WHERE p.age = 18 OR p.age = 19
DETACH DELETE p

8.检索某个Person的所有朋友和这些朋友的所有朋友；

MATCH (p:Person {education_school_id:51,work_employer_id:148})-[:Be_Friend_With]-(friend)
OPTIONAL MATCH (friend)-[:Be_Friend_With]-(fof)
RETURN friend, COLLECT(fof) AS friendsOfFriends

9.检索某个Person的所有朋友集合和其所在的circle的所有Person集合；

MATCH (p:Person {education_school_id:51 ,work_employer_id:148})-[:Be_Friend_With]-(friend)
MATCH (person:Person {circle: p.circle})
RETURN COLLECT(DISTINCT friend) AS friends, COLLECT(DISTINCT person) AS circleMembers

10.任选三对Person，查找每一对Person间的最短关系链（即图模型的最短路）；

MATCH path = shortestPath((p1:Person)-[:Be_Friend_With*]-(p2:Person))
WHERE p1.age = 25 AND p2.age = 21 
RETURN p1, p2, length(path) AS shortestPathLength
LIMIT 1

UNION

MATCH path = shortestPath((p1:Person)-[:Be_Friend_With*]-(p2:Person))
WHERE p1.age = 25 AND p2.age = 24 
RETURN p1, p2, length(path) AS shortestPathLength
LIMIT 1

UNION

MATCH path = shortestPath((p1:Person)-[:Be_Friend_With*]-(p2:Person))
WHERE p1.age = 21 AND p2.age = 24 
RETURN p1, p2, length(path) AS shortestPathLength
LIMIT 1

11.对于人数少于两个的circle，删除掉这些circle里的Person的表示circle信息的属性；

MATCH (p:Person)
WITH p.circle AS circle, COUNT(*) AS count
WHERE count < 2
MATCH (p:Person {circle: circle})
REMOVE p.circle

12.按年龄升序排序所有Person后，再按hometown;id属性的字符串值降序排序，然后返回第5、6、 7、8、9、10名Person，由于一些节点的hometown;id可能是空的（即没有这个属性），对于null值的节点要从排序列表里去掉；

MATCH (p:Person)
WHERE EXISTS(p.hometown_id)
RETURN p
ORDER BY p.age ASC, p.hometown_id DESC
SKIP 4 LIMIT 6

13.检索某个Person的二级和三级朋友集合（A的直接朋友（即有边连接）的称之为一级朋友，A的N级朋友的朋友称之为N+1级朋友，主要通过路径长度来区分，即A的N级朋友与A的所有路径中，有一条长度为N）；

MATCH (p:Person {age:25,work_employer_id:148})-[:Be_Friend_With]-(f1:Person) // 一级朋友
WITH DISTINCT f1, p
MATCH (f1)-[:Be_Friend_With]-(f2:Person) // 二级朋友
WHERE f2 <> f1 AND f2 <> p
WITH DISTINCT f2, f1, p
MATCH (f2)-[:Be_Friend_With]-(f3:Person) // 三级朋友
WHERE f3 <> f2 AND f3 <> f1 AND f3 <> p
RETURN DISTINCT f2,f3

14.获取某个Person的所有朋友的education;school; id属性的list；

MATCH (p:Person {education_school_id:51 ,work_employer_id:148})-[:Be_Friend_With]-(friend:Person)
RETURN COLLECT(friend.education_school_id) AS educationSchoolIds

15.任选三对Person，查找每一对Person的关系路径中长度小于10的那些路径，检索出这些路径上年龄大于22的Person集合，在这一查询中，由于数据量及Person的选取问题，可能导致该查询难以计算出结果，因此可以将10这一数字下调至可计算的程度（自行决定，但请保证>=2），或者更换Person对；

MATCH (p1:Person)-[:Be_Friend_With*2..10]-(p2:Person)
WHERE p1 <> p2
WITH p1, p2, ALLSHORTESTPATHS((p1)-[:Be_Friend_With*]-(p2)) AS paths
WITH p1, p2, LENGTH(paths[0]) AS pathLength
MATCH path = (p1)-[:Be_Friend_With*2..10]-(p2)
WHERE ALL(person IN NODES(path)[1..-1] WHERE person.age > 22)
WITH p1, p2, COLLECT(DISTINCT NODES(path)[1..-1]) AS persons
RETURN p1, p2, persons