图数据库批量导入(HugeGraph-Loader )
章节
第一章链接: 图数据库批量导入(HugeGraph-Loader )
前言
HugeGraph-Loader 是 HugeGragh 的数据导入组件,能够将多种数据源的数据转化为图的顶点和边并批量导入到图数据库中。
目前支持的数据源包括:
- 本地磁盘文件或目录,支持 TEXT、CSV 和 JSON 格式的文件,支持压缩文件
- HDFS 文件或目录,支持压缩文件
- 主流关系型数据库,如 MySQL、PostgreSQL、Oracle、SQL Server本地磁盘文件和HDFS 文件,支持断点续传。
一、Loader执行流程
使用 HugeGraph-Loader 的基本流程分为以下几步:
- 编写图模型
- 准备数据文件
- 编写输入源映射文件
- 执行命令导入
二、csv文件导入
1.数据映射文件
数据映射文件如下,如果csv文件有表头,input下的header是不需要赋值的,如果赋值了会吧第一行当做数据解析
{
"version": "2.0",
"structs": [{
"id": "1",
"skip": false,
"input": {
"type": "FILE",
"path": "/mnt/parastor/aimind/kg-resources/Oakcsys1/d2r/job-63c3b6727701166100cd7426/file-mapping-7f19ceeea95a417495bc33bd54fa1bf9/人员列表1.csv",
"file_filter": {
"extensions": ["*"]
},
"format": "CSV",
"delimiter": ",",
"date_format": "yyyy-MM-dd HH:mm:ss",
"time_zone": "GMT+8",
"skipped_line": {
"regex": "(^#|^//).*|"
},
"compression": "NONE",
"batch_size": 500,
"header": null,
"charset": "GBK",
"list_format": {
"start_symbol": "",
"end_symbol": "",
"elem_delimiter": "|",
"ignored_elems": [""]
}
},
"vertices": [{
"label": "ry",
"skip": false,
"id": "姓名",
"unfold": true,
"field_mapping": {
"年龄": "nl",
"性别": "xb"
},
"value_mapping": {},
"selected": ["姓名", "年龄", "性别"],
"ignored": [],
"null_values": ["Null"],
"update_strategies": {},
"field_formats": []
}],
"edges": []
}]
}
三、json文件导入
1.数据映射文件
数据映射文件如下,json文件没有表头,header需要赋值,json格式为每行一个JSON数据集
例如:
{“name”: “marko”, “sex”: “男”, “age”: “12”, “weight”: “0.4”}
{“name”: “josh”, “sex”: “女”, “age”: “16”, “weight”: “0.4”}
{
"version": "2.0",
"structs": [{
"id": "1",
"skip": false,
"input": {
"type": "FILE",
"path": "C:\\Users\\kmliu\\Desktop\\上传文件2\\t_user3.json",
"file_filter": {
"extensions": ["*"]
},
"format": "JSON",
"delimiter": ",",
"date_format": "yyyy-MM-dd HH:mm:ss",
"time_zone": "GMT+8",
"skipped_line": {
"regex": "(^#|^//).*|"
},
"compression": "NONE",
"batch_size": 500,
"header": ["sex", "name", "weight", "age"],
"charset": "UTF-8",
"list_format": {
"start_symbol": "",
"end_symbol": "",
"elem_delimiter": "|",
"ignored_elems": [""]
}
},
"vertices": [{
"label": "ry2",
"skip": false,
"id": "name",
"unfold": true,
"field_mapping": {
"sex": "sex",
"age": "age"
},
"value_mapping": {},
"selected": ["sex", "name", "age"],
"ignored": [],
"null_values": ["Null"],
"update_strategies": {},
"field_formats": []
}],
"edges": []
}]
}
三、mysql数据导入
1.数据映射文件
数据映射文件如下
{
"version": "2.0",
"structs": [{
"id": "1",
"skip": false,
"input": {
"type": "JDBC",
"vendor": "MYSQL",
"header": ["id", "name", "age", "sex"],
"charset": "UTF-8",
"list_format": {
"start_symbol": "",
"end_symbol": "",
"elem_delimiter": "|",
"ignored_elems": [""]
},
"driver": "com.mysql.cj.jdbc.Driver",
"url": "jdbc:mysql://xxx.xxx.xxx.xxx:3306",
"database": "baseName",
"schema": null,
"table": "user3",
"username": "root",
"password": "root",
"batch_size": 500,
"primary_key": "name"
},
"vertices": [{
"label": "ry",
"skip": false,
"id": "name",
"unfold": true,
"field_mapping": {
"age": "nl",
"sex": "xb"
},
"value_mapping": {},
"selected": ["sex", "name", "age"],
"ignored": [],
"null_values": ["Null"],
"update_strategies": {},
"field_formats": []
}],
"edges": []
}]
}
四、调用hugGraph步骤
1.调用方法入口
参数Oakcsys1是图数据库里面的集合名词,json1.json是数据映射文件,在上面已经描述了生成的规则,“xxx.xx.xx.xx”, “-p”, "18081"这个问图数据库的地址以及ip.
public static void main(String[] args) {
// -g {GRAPH_NAME} -f ${INPUT_DESC_FILE} -s ${SCHEMA_FILE} -h {HOST} -p {PORT}
if (args.length == 0) {
args = new String[]{"-g", "Oakcsys1",
"-f", "C:\\Users\\kmliu\\Desktop\\上传文件2\\json1.json",
"-h", "xxx.xx.xx.xx", "-p", "18081"
};
}
HugeGraphLoader loader;
try {
loader = new HugeGraphLoader(args);
} catch (Throwable e) {
Printer.printError("Failed to start loading", e);
return;
}
loader.load();
}
五、导入数据日志
1.日志如下
: -----映射任务运行中-日志打印-----
: --------------------------------------------------
: detail metrics
: input-struct '1'
: read success : 4
: read failure : 0
: vertex 'ry'
: parse success : 4
: parse failure : 0
: insert success : 4
: insert failure : 0
: --------------------------------------------------
: count metrics
: input read success : 4
: input read failure : 0
: vertex parse success : 4
: vertex parse failure : 0
: vertex insert success : 4
: vertex insert failure : 0
: edge parse success : 0
: edge parse failure : 0
: edge insert success : 0
: edge insert failure : 0
: --------------------------------------------------
: meter metrics
: total time : 5.549s
: vertex load rate(vertices/s) : 0
: edge load rate(edges/s) : 0
: -----映射任务运行中-日志打印结束-----
总结
以上就是HugeGraphLoader的使用基本步骤,主要用于将数据集导入到图数据库中,支持csv,json,txt,mysql,hive等的数据导入。导入速度很快