图数据库hugegraph如何快速导入10亿+数据

最新推荐文章于 2024-09-20 05:00:00 发布

enjoy编程

最新推荐文章于 2024-09-20 05:00:00 发布

阅读量2.7k

点赞数 2

分类专栏：知识图谱文章标签：知识图谱图数据库

本文链接：https://blog.csdn.net/penriver/article/details/115124350

版权

知识图谱专栏收录该内容

22 篇文章 75 订阅 ¥19.90 ¥99.00

订阅专栏

超级会员免费看

本文详细介绍了如何使用Hugegraph快速导入10亿+的关系数据，涉及数据预处理、图模型构建、数据源映射及执行导入的过程。在一台配备2.40GHz CPU和256GB内存的服务器上，经过处理和导入，最终成功构建了知识图谱，数据占用约80GB磁盘空间。

摘要由CSDN通过智能技术生成

随着社交、电商、金融、零售、物联网等行业的快速发展，现实社会织起了了一张庞大而复杂的关系网，亟需一种支持海量复杂数据关系运算的数据库即图数据库。本系列文章是学习知识图谱以及图数据库相关的知识梳理与总结

本文会包含如下内容：

如何快速导入10亿+数据到图数据库

本篇文章适合人群：架构师、技术专家、对知识图谱与图数据库感兴趣的高级工程师

1. hugegraph环境

hugegraph server版本0.11.2，后端存储使用的是RocksDB，单机版本。服务器版本是：CPU: 2 * Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz，内存：256GB，硬盘：SAS盘

loader版本0.11.1

测试时loader在Server所在的服务器上执行。如果下载、安装部署Server,loader请参考：https://hugegraph.github.io/hugegraph-doc/quickstart/hugegraph-server.html

2. 数据准备

2.1 原始数据说明

测试导入的数据是Stanford的公开数据集Friendster，该数据约31G左右。

Friendster是一个在线游戏网络。在作为游戏网站重新推出之前，Friendster是一个用户可以建立友谊的社交网站，该数据就是用户之间的关系数据。该数据由Web Archive项目提供，其中有完整的图表。详见：http://snap.stanford.edu/data/com-Friendster.html。

此数据集的统计信息如下：共有65608366个顶点，1806067135条边，约18亿+的数据量。

请自行下载com-friendster.ungraph.txt.gz 压缩包，大小约8.7GB，经过gunzip解压后，大小约31GB

解压后读取前10行数据，如下图，文件结构很简单，每一行代表一条边，其包含两列，第一列是源顶点Id，第二列是目标顶点Id，两列之间以\t分隔。另外，文件最上面几行是一些概要信息，它说明了文件共有65608366个顶点，1806067135条边（行）.

因为是记录用户之间的关系数据，所以文件中的顶点数据有很多是重复的。为了能正确的导入顶点数据，需要针对顶点数据进行去重处理。

2.2 数据处理

因为顶点数据量很大，需要采用在处理海量数据领域颇为常见的处理手段：分而治之，具体步骤如下：

将原始文件中的全部顶点Id利用hash算法，分成较均匀的若干份，保证在每份之间没有重复的，在每份内部允许有重复的；
对每一份文件，定义一个内存的set容器用于判断某个Id是否存在，针对处理的每个顶点Id，如果不包含，则加入到容器，并写入到新文件中，否则跳过不处理
脚本运行命令如下： python friendster-process.py com-friendster.ungraph.txt com-friendster.vertex.txt，注意：在原始文件的所在的目录执行，第一个参数是原始文件名称，第二个参数是处理后的文件名，默认使用取余算法，分成10份，并将结果写到当前路径。

脚本运行完成后，可以看到merge file com-friendster.vertex.txt contain 65608366 lines，与文件描述相符的。

具体源代码如下：

#!/usr/bin/python
# coding=utf-8
import os
import sys

class BigFileProcess:
    '''
    大文件处理类，处理策略是：将大文件按hash策略拆分为小文件，然后再将小文件合并
    '''
    shard_file_dict = {}
    shard_prefix = 'tmp_shard_'

    def __init__(self, raw_file_path, out_file_name, out_dir='./', shard_num=10):
        self.raw_file_path = raw_file_path
        self.out_dir = out_dir
        self.shard_num = shard_num
        self.out_file_name = out_file_name

    def create_shard_files(self):
        for i in range(self.shard_num):
            shard_file = open(os.path.join(self.out_dir, self.shard_prefix + str(i)), "w")
            self.shard_file_dict[i] = shard_file

    def close_shard_files(self):
        for file in self.shard_file_dict.values():
            file.close()

    def del_shard_files(self):
        for key in self.shard_file_dict.keys():
            shard_file = os.path.join(self.out_dir, self.shard_prefix + str(key))
            if os.path.exists(shard_file):
                os.remove(shard_file)
        print("删除{}个shard files完成".format(self.shard_num))

    def split_shard_file(self):
        print("创建并打开{}个shard files".format(self.shard_num))
        self.create_shard_files()
        line_count = 0
        with open(self.raw_file_path, "r") as lines:
            for line in lines:
                line_count += 1
                # Skip comment line
                if line.startswith('#'):
                    continue
                # 不设置分隔符means split according to any whitespace,包括空格、换行(\n)、制表符(\t)等
                parts = line.strip().split(maxsplit=1)
                if not len(parts) == 2:
                    print("line {} format error".format(parts))

                # Python3 整型是没有限制大小的，可以当作 Long 类型使用
                # 将源顶点、目标顶点写到对应的shard file中
                for i in range(2):
                    node_id = int(parts[i])
                    index = node_id % self.shard_num
                    self.shard_file_dict[index].write(str(node_id) + '\n')

                if line_count % 1000000:
                    print("已经处理{}行".format(line_count))
        print("共处理{}行".format(line_count))
        print("Split original file:{} info {} shard files".format(self.raw_file_path, self.shard_num))
        self.close_shard_files()
        print("关闭{}个shard files完成".format(self.shard_num))

    def merge_shard_file(self):
        output_file_path = os.path.join(self.out_dir, self.out_file_name)
        print("Prepare duplicate and merge {} shard files into %s".format(self.shard_num, output_file_path))
        merge_file = open(output_file_path, "w")
        line_count = 0

        # 基于单个文件去重，并合并到单个文件中
        for key in self.shard_file_dict.keys():
            shard_file = os.path.join(self.out_dir, self.shard_prefix + str(key))
            with open(shard_file, "r+") as lines:
                elems = {}
                # 处理每一行
                for line in lines:
                    # Filter duplicate id
                    if not (line in elems):
                        elems[line] = ""
                        merge_file.write(line)
                        line_count += 1
            print("Processed shard file {} completed".format(shard_file))
        merge_file.close()
        print("all {} shard files and merge into {} completed".format(self.shard_num, output_file_path));
        print("merge file {} contain {} lines".format(output_file_path, line_count))


if __name__ == '__main__':
    #file_process = BigFileProcess("com-friendster.ungraph.txt", "com-friendster.vertex.txt")
    print('参数个数为:{}个参数'.format(len(sys.argv)))
    print('参数列表:{}'.format(str(sys.argv)))
    file_process = BigFileProcess(sys.argv[1], sys.argv[2])

    file_process.split_shard_file()
    file_process.merge_shard_file()
    file_process.del_shard_files()

3. 导入准备

在安装loader程序的example目录下创建com-friendster目录，此目录中的文件说明如下：

com-friendster.ungraph.txt 是下载的friendster数据压缩包的解压后的文件

com-friendster.vertex.txt 是处理后只包含顶点的文件

schema.groovy 是图模型描述文件

struct.json 是数据源映射文件

3.1 编写图模型

这一步是建模的过程，用户需要对自己已有的数据和想要创建的图模型有一个清晰的构想，然后编写 schema 建立图模型

针对friendster数据，顶点和边除了Id外，都没有其他的属性，所以图的schema其实很简单

schema.propertyKey("id").asInt().ifNotExist().create();
schema.vertexLabel("person").useCustomizeNumberId().properties("id").ifNotExist().create();
schema.indexLabel("personByAge").onV("person").by("id").range().ifNotExist().create();
schema.indexLabel("personById").onV("person").by("id").unique().ifNotExist().create();
schema.edgeLabel("friend").sourceLabel("person").targetLabel("person").ifNotExist().create();

3.1 编写数据源映射文件

输入源的映射文件用于描述如何将输入源数据与图的顶点类型/边类型建立映射关系，以JSON格式组织，由多个映射块组成，其中每一个映射块都负责将一个输入源映射为顶点和边。

针对friendster数据，只有一个顶点文件和边文件，且文件的分隔符都是\t，所以将input.format指定为TEXT，input.delimiter使用默认即可。

顶点有一个属性id，而顶点文件头没有指明列名，所以需要显式地指定input.header为["id"]，input.header的作用是告诉loader文件的每一列的列名是什么，但要注意：列名并不一定就是顶点或边的属性名，描述文件中有一个field_mapping域用来将列名映射为属性名。

边没有任何属性，边文件中只有源顶点和目标顶点的Id，需要先将input.header指定为["source_id", "target_id"]，这样就给两个列取了不同的名字，然后通过field_mapping 映射为源顶点和目标顶点的主键列即id列

{
  "vertices": [
    {
      "label": "person",
      "input": {
        "type": "file",
        "path": "example/com-friendster/com-friendster.vertex.txt",
        "format": "TEXT",
        "header": ["id"],
        "charset": "UTF-8",
        "skipped_line": {
          "regex": "(^#|^//).*"
        }
      },
      "null_values": ["NULL", "null", ""],
      "id": "id"
    }
  ],
  "edges": [
    {
      "label": "friend",
      "source": ["source_id"],
      "target": ["target_id"],
      "input": {
        "type": "file",
        "path": "example/com-friendster/com-friendster.ungraph.txt",
        "format": "TEXT",
        "header": ["source_id", "target_id"],
        "skipped_line": {
          "regex": "(^#|^//).*"
        }
      },
      "field_mapping": {
        "source_id": "id",
        "target_id": "id"
      }
    }
  ]
}

3.3 执行导入

执行命令后，loader就会开始导入数据，并会打印进度到控制台上，等所有顶点和边导入完成后，会看到以下统计信息。

导入数据约花费50分钟，其中顶点的导入速率是2.2万/秒，边的导入速率是：60.8万/秒，且导入的点、边数据量与friendster的描述一致

hugegraph的图数据占用磁盘空间约80GB，相较于原始数据大小，扩大了约2倍

并且hugegraph-hubble无法使用，因为图 testgraph 已占用磁盘 585,728MB 超过上限 102,400MB 【此空间比实际磁盘占用空间要大很多，不知道原因】，提示不能在hubble上面使用。

nohup bin/hugegraph-loader.sh -h 12.25.12.19 -p 8080 -g testgraph -f example/com-friendster/struct.json -s example/com-friendster/schema.groovy &

3. 简单查询

在hugegraph-studio中执行

3.1 统计点边的数量

g.V().count() 可以正确统计出数量为： 65608366

g.E().count() 可能因为边的数量太大，导致Script evaluation exceeded the configured 'scriptEvaluationTimeout' threshold of 30000 ms 执行超时而报错

3.2 查询点边数据

g.V().has("id",101) 或 g.V(101) 查询顶点ID是101的顶点

g.V(101).outE() 查询ID是101的顶点的OUT方向的所有邻接边

enjoy编程

关注

2
点赞
踩
13

收藏

觉得还不错? 一键收藏
打赏
7
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录