JanusGraph重新索引reindex

Chapter 31.1 Reindexing

9.1章和9.2章 “Graph Index” and Section 9.2, “Vertex-centric Indexes” 已经讲了如何创建全局的和vertex-centric的索引来提高查询性能。如果索引的key和label是在同一个事务中新创建的则索引会立即生效,这样就无需执行reindex操作;如果在建索引之前,索引的key和label已经存在,则就必须要对整个图中与索引相关的元素执行reindex操作以保证索引包含了之前的元素。本章的主要内容就是reindex操作。

这里写图片描述reindex是一个包含了多个步骤的手动的过程。这些步骤的顺序必须要正确,否则,会导致索引不一致的情况。

31.1.1. 综述

一个索引定义之后,JanusGraph可以立刻开始自增索引的更新写操作。然而,在索引完成和可用之前,JanusGraph必须同时进行一次性读取所有的与新创建索引相关的元素。一旦reindex完成,索引就会完全对已有的数据生效且为enabled即可用状态。

31.1.2. 在reindex之前

reindex过程开始的点是创建一个索引。参考第9章关于如何创建一个全局或与某个集合绑定的索引。注意,一个全局的索引名字是唯一的,其名字就是其id。一个与某集合绑定的索引,它的唯一性依据是 索引名+label或索引中的property key (即本章后面涉及到的index type的name 并且只适用于vertex-centric索引)

在给某一个已存的集合创建一个新索引之后,需要等待几分钟来将新索引通知给集群中的其他节点。注意,reindex时,索引的名字(假设索引类型为vertex-centric)是必须的。

31.1.3. 准备reindex

reindex操作可以有两个执行框架的选择:

  • MapReduce
  • JanusGraphManagement

MapReduce的Reindex 支持大的和水平分布的数据库。JanusGraphManagement的reindex操作是单机线性的OLAP操作,这是专为为单机就能满足方便性和速度要求的较小的数据库设计的。

Reindex需要:

  • 索引名
  • 索引类型(如果是vertex-centric类型的索引还需要label名或property key,其他索引不需要)

31.1.4. 执行重新索引on MapReduce

基于MapReduce来生成和执行一个reindex操作的比较推荐的方式是通过 MapReduceIndexManagement这个类。下面是使用这个类来运行重新索引的大概步骤:

  • open一个JanusGraph对象
  • 将graph实例传给MapReduceIndexManagement这个类的构造方法
  • 调用MapReduceIndexManagement实例化后的对象的updateIndex(index, SchemaAction.REINDEX)方法
  • 如果索引还没有enabled,通过JanusGraphManagement来将它变为enabled

MapReduceIndexManagement类实现了updateIndex方法,updateIndex方法只支持SchemaAction的REINDEX 和 REMOVE_INDEX 操作。该类会使用Hadoop的配置和classpath中的jars包来开启一个Hadoop MapReduce的job。该类对Hadoop1和Hadoop2都支持。该类通过JanusGraph实例构造方法传进来的参数来获取索引的metadata和存储后端(例如Cassandra分区)

graph = JanusGraphFactory.open(...)
mgmt = graph.openManagement()
mr = new MapReduceIndexManagement(graph)
mr.updateIndex(mgmt.getRelationIndex(mgmt.getRelationType("battled"), "battlesByTime"), SchemaAction.REINDEX).get()
mgmt.commit()

31.1.4.1. MapReduce重新索引示例

下面的gremlin语句使用了一个单独的实例,包含了MapReduce reindex过程的所有步骤,其存储后端为Cassandra:

// Open a graph
graph = JanusGraphFactory.open("conf/janusgraph-cassandra-es.properties")
g = graph.traversal()

// Define a property
mgmt = graph.openManagement()
desc = mgmt.makePropertyKey("desc").dataType(String.class).make()
mgmt.commit()

// Insert some data
graph.addVertex("desc", "foo bar")
graph.addVertex("desc", "foo baz")
graph.tx().commit()

// Run a query -- note the planner warning recommending the use of an index
g.V().has("desc", containsText("baz"))

// Create an index
mgmt = graph.openManagement()

desc = mgmt.getPropertyKey("desc")
mixedIndex = mgmt.buildIndex("mixedExample", Vertex.class).addKey(desc).buildMixedIndex("search")
mgmt.commit()

// Rollback or commit transactions on the graph which predate the index definition
graph.tx().rollback()

// Block until the SchemaStatus transitions from INSTALLED to REGISTERED
report = mgmt.awaitGraphIndexStatus(graph, "mixedExample").call()

// Run a JanusGraph-Hadoop job to reindex
mgmt = graph.openManagement()
mr = new MapReduceIndexManagement(graph)
mr.updateIndex(mgmt.getGraphIndex("mixedExample"), SchemaAction.REINDEX).get()

// Enable the index
mgmt = graph.openManagement()
mgmt.updateIndex(mgmt.getGraphIndex("mixedExample"), SchemaAction.ENABLE_INDEX).get()
mgmt.commit()

// Block until the SchemaStatus is ENABLED
mgmt = graph.openManagement()
report = mgmt.awaitGraphIndexStatus(graph, "mixedExample").status(SchemaStatus.ENABLED).call()
mgmt.rollback()

// Run a query -- JanusGraph will use the new index, no planner warning
g.V().has("desc", containsText("baz"))

// Concerned that JanusGraph could have read cache in that last query, instead of relying on the index?
// Start a new instance to rule out cache hits.  Now we're definitely using the index.
graph.close()
graph = JanusGraphFactory.open("conf/janusgraph-cassandra-es.properties")
g.V().has("desc", containsText("baz"))

31.1.5. 基于JanusGraphManagement 执行更新索引操作

基于JanusGraphManagement执行一个reindex操作,通过SchemaAction.REINDEX 参数调用JanusGraphManagement.updateIndex方法,代码示例如下:

m = graph.openManagement()
i = m.getGraphIndex('indexName')
m.updateIndex(i, SchemaAction.REINDEX).get()
m.commit()

31.1.5.1. JanusGraphManagement重新索引示例

下面的示例是存储后端为BerkeleyDB的JanusGraph数据库,定义一个索引之后,使用 JanusGraphManagement,重新索引,并最终使索引可用:

import org.janusgraph.graphdb.database.management.ManagementSystem

// Load some data from a file without any predefined schema
graph = JanusGraphFactory.open('conf/janusgraph-berkeleyje.properties')
g = graph.traversal()
m = graph.openManagement()
m.makePropertyKey('name').dataType(String.class).cardinality(Cardinality.LIST).make()
m.makePropertyKey('lang').dataType(String.class).cardinality(Cardinality.LIST).make()
m.makePropertyKey('age').dataType(Integer.class).cardinality(Cardinality.LIST).make()
m.commit()
graph.io(IoCore.gryo()).readGraph('data/tinkerpop-modern.gio')
graph.tx().commit()

// Run a query -- note the planner warning recommending the use of an index
g.V().has('name', 'lop')
graph.tx().rollback()

// Create an index
m = graph.openManagement()
m.buildIndex('names', Vertex.class).addKey(m.getPropertyKey('name')).buildCompositeIndex()
m.commit()
graph.tx().commit()

// Block until the SchemaStatus transitions from INSTALLED to REGISTERED
ManagementSystem.awaitGraphIndexStatus(graph, 'names').status(SchemaStatus.REGISTERED).call()

// Reindex using JanusGraphManagement
m = graph.openManagement()
i = m.getGraphIndex('names')
m.updateIndex(i, SchemaAction.REINDEX)
m.commit()

// Enable the index
ManagementSystem.awaitGraphIndexStatus(graph, 'names').status(SchemaStatus.ENABLED).call()

// Run a query -- JanusGraph will use the new index, no planner warning
g.V().has('name', 'lop')
graph.tx().rollback()

// Concerned that JanusGraph could have read cache in that last query, instead of relying on the index?
// Start a new instance to rule out cache hits.  Now we're definitely using the index.
graph.close()
graph = JanusGraphFactory.open("conf/janusgraph-berkeleyje.properties")
g = graph.traversal()
g.V().has('name', 'lop')

31.2. Index Removal-索引删除

这里写图片描述Warning索引删除是由多个步骤组成的手动过程。 必须按照正确的顺序仔细执行这些步骤,以避免造成索引不一致的情况。

31.2.1. Overview-概述

索引删除包含两个处理阶段。 在第一阶段,一个JanusGraph通过存储后端向所有其他人发送信号,指出该索引将被删除。这一步骤会将索引的状态置为DISABLED。此时,JanusGraph停止使用索引来回答查询并停止增量更新索引。 存储后端中与索引相关的数据仍然存在但被忽略。

第二阶段取决于索引是composite索引还是mixed索引。composite 索引可以直接被JanusGraph直接删除。和reindexing步骤一样,删除可以通过MapReduce或JanusGraphManagement完成。但是,mixed索引必须手动在索引后端执行删除操作; JanusGraph不提供从其索引后端删除mixed索引的自动机制。

删除索引操作删除了与索引相关联的所有内容,除了其模式定义和DISABLED状态。 该索引的架构存根即使在删除后仍然保留,尽管其存储占用空间可以忽略不计并且是固定的。 个人实践:JanusGraph禁用索引置为DISABLED状态,但在通过JanusGraph查询这个索引索引仍然会存在。即使通过ElasticSearch删除了索引数据,但JanusGraph中这个索引仍然是可以查到的,但状态是DISABLED。为什么删除索引还要保存这个架构?有待JanusGraph官方解答。

31.2.2. Preparing for Index Removal - 索引删除前置条件

如果索引当前ENABLED状态,则应先禁用该索引置为DISABLED状态。 这是通过ManagementSystem完成的。

mgmt = graph.openManagement()
rindex = mgmt.getRelationIndex(mgmt.getRelationType("battled"), "battlesByTime")
mgmt.updateIndex(rindex, SchemaAction.DISABLE_INDEX).get()
gindex = mgmt.getGraphIndex("byName")
mgmt.updateIndex(gindex, SchemaAction.DISABLE_INDEX).get()
mgmt.commit()

一旦索引上所有键的状态变为DISABLED,索引就可以被删除了。 ManagementSystem中的实用程序可以自动执行“wait-for-DISABLED”这一步骤完成:

ManagementSystem.awaitGraphIndexStatus(graph, 'byName').status(SchemaStatus.DISABLED).call()

在composite 索引DISABLED后,可以在下面两个执行框架之间选择删除:

  • MapReduce
  • JanusGraphManagement

MapReduce上的索引删除支持大型的水平分布式数据库。JanusGraphManagement上的索引删除适合单机OLAP作业。这样做的目的是方便和快速地处理那些足够小的可以通过一台机器就能处理的数据库。

索引删除所需条件:

  • 索引的name
  • 索引类型(构建vertex-centric索引的edge的label或property的name) 只适用于vertex-centric 索引,如果是其他的索引,忽略这个条件

如概述中所述,必须从索引后端手动删除混合索引。MapReduce框架和JanusGraphManagement框架都不会从索引后端中删除mixed后端。

31.2.3. Executing an Index Removal Job on MapReduce -MapReduce框架执行索引删除

与重建索引一样,在MapReduce上生成和运行索引移除作业的推荐方法是通过MapReduceIndexManagement类,以下是使用此类运行索引删除作业的步骤的概要:

  • 打开一个JanusGraph 实例
  • 如果索引不是DISABLED状态,先使用JanusGraphManagement将其置为DISABLED状态
  • 将graph 实例传给MapReduceIndexManagement类的构造器
  • 调用updateIndex(index_name, SchemaAction.REMOVE_INDEX)方法

    下一小节会有一个注释代码示例。

31.2.3.1. Example for MapReduce

// 注意:本示例只能删除composite类型索引
import org.janusgraph.graphdb.database.management.ManagementSystem

// Load the "Graph of the Gods" sample data
graph = JanusGraphFactory.open('conf/janusgraph-cassandra-es.properties')
g = graph.traversal()
GraphOfTheGodsFactory.load(graph)

g.V().has('name', 'jupiter')

// Disable the "name" composite index  
m = graph.openManagement()
nameIndex = m.getGraphIndex('name')
m.updateIndex(nameIndex, SchemaAction.DISABLE_INDEX).get()
m.commit()
graph.tx().commit()

// Block until the SchemaStatus transitions from INSTALLED to REGISTERED
ManagementSystem.awaitGraphIndexStatus(graph, 'name').status(SchemaStatus.DISABLED).call()

// 使用MapReduceIndexJobs删除索引 Delete the index using MapReduceIndexJobs
m = graph.openManagement()
mr = new MapReduceIndexManagement(graph)
future = mr.updateIndex(m.getGraphIndex('name'), SchemaAction.REMOVE_INDEX)
m.commit()
graph.tx().commit()
future.get()

// Index still shows up in management interface as DISABLED -- this is normal
//索引删除后仍然可以查到,状态为DISABLED ,这是正常的
m = graph.openManagement()
idx = m.getGraphIndex('name')
idx.getIndexStatus(m.getPropertyKey('name'))
m.rollback()

// JanusGraph should issue a warning about this query requiring a full scan
g.V().has('name', 'jupiter')

31.2.4. Executing an Index Removal job on JanusGraphManagement

要在JanusGraphManagement上运行索引删除作业,请使用SchemaAction.REMOVE_INDEX参数调用JanusGraphManagement.updateIndex。示例如下:

m = graph.openManagement()
i = m.getGraphIndex('indexName')
m.updateIndex(i, SchemaAction.REMOVE_INDEX).get()
m.commit()

31.2.4.1. Example for JanusGraphManagement

以下内容将一些索引样本数据加载到存储后端为BerkeleyDB的JanusGraph数据库中,然后通过JanusGraphManagement禁用并删除索引:

import org.janusgraph.graphdb.database.management.ManagementSystem

// Load the "Graph of the Gods" sample data
graph = JanusGraphFactory.open('conf/janusgraph-cassandra-es.properties')
g = graph.traversal()
GraphOfTheGodsFactory.load(graph)

g.V().has('name', 'jupiter')

// Disable the "name" composite index
m = graph.openManagement()
nameIndex = m.getGraphIndex('name')
m.updateIndex(nameIndex, SchemaAction.DISABLE_INDEX).get()
m.commit()
graph.tx().commit()

// Block until the SchemaStatus transitions from INSTALLED to REGISTERED
ManagementSystem.awaitGraphIndexStatus(graph, 'name').status(SchemaStatus.DISABLED).call()

// Delete the index using JanusGraphManagement
m = graph.openManagement()
nameIndex = m.getGraphIndex('name')
future = m.updateIndex(nameIndex, SchemaAction.REMOVE_INDEX)
m.commit()
graph.tx().commit()

future.get()

m = graph.openManagement()
nameIndex = m.getGraphIndex('name')

g.V().has('name', 'jupiter')

31.3. Common Problems with Index Management

31.3.1. IllegalArgumentException when starting job

索引建立后不久即开始重新索引作业时,作业可能会失败,并出现以下异常之一:

The index mixedExample is in an invalid state and cannot be indexed.
The following index keys have invalid status: desc has status INSTALLED
(status must be one of [REGISTERED, ENABLED])
The index mixedExample is in an invalid state and cannot be indexed.
The index has status INSTALLED, but one of [REGISTERED, ENABLED] is required

建立索引时,其存在会广播给集群中的所有其他JanusGraph实例。 那些JanusGraph实例必须都已确认(acknowledgement )索引的存在,才能开始重新索引过程。根据群集大小和连接速度,这可能需要一段时间来完成广播。因此,在建立索引之后和启动reindex过程之前,应该等待几分钟。请注意,由于JanusGraph实例失败,其他实例的通知确认操作可能失败。 换句话说,集群可能无限期地等待确认(acknowledgement )失败的实例。在这种情况下,用户必须从集群注册表中手动删除失败实例,如第30章故障和恢复Chapter 30, Failure & Recovery所述。在群集状态恢复之后,必须通过在管理系统中再次手动注册索引来重新启动确认(acknowledgement )过程。

mgmt = graph.openManagement()
rindex = mgmt.getRelationIndex(mgmt.getRelationType("battled"),"battlesByTime")
mgmt.updateIndex(rindex, SchemaAction.REGISTER_INDEX).get()
gindex = mgmt.getGraphIndex("byName")
mgmt.updateIndex(gindex, SchemaAction.REGISTER_INDEX).get()
mgmt.commit()

在等待几分钟后,确认才能到达,reindex作业就应该可以成功启动。

31.3.2. Could not find index

重新索引作业中的此异常表明具有给定名称的索引不存在,或名称未被正确指定。重新索引全局图索引时,应仅指定构建索引时定义的索引名称。重新索引全局图索引时,除了定义了vertex-centric索引的边标签或属性键的名称之外,还必须给出索引的名称。

31.3.3. Cassandra Mappers Fail with “Too many open files”

异常堆栈跟踪的最后面的信息可能如下所示:

java.net.SocketException: Too many open files
       at java.net.Socket.createImpl(Socket.java:447)
       at java.net.Socket.getImpl(Socket.java:510)
       at java.net.Socket.setSoLinger(Socket.java:988)
       at org.apache.thrift.transport.TSocket.initSocket(TSocket.java:118)
       at org.apache.thrift.transport.TSocket.<init>(TSocket.java:109)

未完:
When running Cassandra with virtual nodes enabled, the number of virtual nodes seems to set a floor under the number of mappers. Cassandra may generate more mappers than virtual nodes for clusters with lots of data, but it seems to generate at least as many mappers as there are virtual nodes even though the cluster might be empty or close to empty. The default is 256 as of this writing.

Each mapper opens and quickly closes several sockets to Cassandra. The kernel on the client side of those closed sockets goes into asynchronous TIME_WAIT, since Thrift uses SO_LINGER. Only a small number of sockets are open at any one time — usually low single digits — but potentially many lingering sockets can accumulate in TIME_WAIT. This accumulation is most pronounced when running a reindex job locally (not on a distributed MapReduce cluster), since all of those client-side TIME_WAIT sockets are lingering on a single client machine instead of being spread out across many machines in a cluster. Combined with the floor of 256 mappers, a reindex job can open thousands of sockets of the course of its execution. When these sockets all linger in TIME_WAIT on the same client, they have the potential to reach the open-files ulimit, which also controls the number of open sockets. The open-files ulimit is often set to 1024.

Here are a few suggestions for dealing with the “Too many open files” problem during reindexing on a single machine:

Reduce the maximum size of the Cassandra connection pool. For example, consider setting the cassandrathrift storage backend’s max-active and max-idle options to 1 each, and setting max-total to -1. See Chapter 13, Configuration Reference for full listings of connection pool settings on the Cassandra storage backends.
Increase the nofile ulimit. The ideal value depends on the size of the Cassandra dataset and the throughput of the reindex mappers; if starting at 1024, try an order of magnitude larger: 10000. This is just necessary to sustain lingering TIME_WAIT sockets. The reindex job won’t try to open nearly that many sockets at once.
Run the reindex task on a multi-node MapReduce cluster to spread out the socket load.

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值