《A Distributed Graph Engine for Web Scale RDF Data》2013——笔记

最新推荐文章于 2022-11-26 19:40:38 发布

水木-刘

最新推荐文章于 2022-11-26 19:40:38 发布

阅读量1k

点赞数

分类专栏：论文笔记

本文链接：https://blog.csdn.net/u013319237/article/details/70209905

版权

论文笔记专栏收录该内容

20 篇文章 2 订阅

订阅专栏

ABSTRACT

现有系统无法有效处理Web规模的RDF数据，不支持对RDF数据的许多有用和通用的基于图形的操作。本文使用Trinity.RDF，以原始图形式存储RDF数据，而不是三元组或者位图矩阵。

Introduction

RDF数据越来越多。
数据库管理系统面临两个挑战：systems’ scalability and generality.
1.目前以三元组为形式并使用RDBMS进行存储，索引和查询处理，可拓展性不强，因为处理查询经常涉及大量的中间结果的连接操作。
2.现有系统不支持RDF数据的通用查询。大部分仅针对SPARQL进行优化。但是SPARQL查询有缺陷，比如可达性查询，random walks。
Overview of Our Approach
Trinity.RDF——a distributed in-memory RDF system
构建在内存云之上，并以图的形式保存。
它不仅为SPARQL查询处理带来了新的优化机会，而且还支持RDF数据更高级的图形分析。
目前大部分对图的操作完全依赖随机访问，所以以三元组的形式存储到磁盘上不妥，因为硬盘的随机访问很慢，通过索引增加访问速度的同时，也引入了过多的连接操作。
优点1：将RDF数据保存为内存中的图，提高了随机访问的效率。
优点2：利用内存内的图搜索，减少连接操作，和中间结果的数量。
Specifically, we decompose a SPARQL query into a set of triple patterns, and conduct a sequence of graph explorations to generate bindings for each of the triple pattern. The exploration-based approach uses the binding information of the explored subgraphs to prune candidate matches in a greedy manner.
优点3：可以对RDF数据进行大量的高级图表分析，如random walks, regular expression queries, reachability queries, distance oracles, community searches。
本文贡献：
1. We introduce a novel graph-based scheme for managing RDF data. Trinity.RDF has the potential to support efficient graph-based queries, as well as advanced graph analytics, on RDF.
2. We leverage graph exploration for SPARQL processing. The new query paradigm greatly reduces the volume of intermediate results, which in turn boosts query performance and system scalability.
3. We introduce a new cost model, novel cardinality estimation techniques, and optimization algorithms for distributed query plan generation. These approaches ensure excellent performance on web scale RDF data.

2 Join vs. Graph Exploration

2.1 RDF and SPARQL

2.2 Using Join Operations

两个阶段：扫描阶段，连接阶段。并举例图1，表1。
虽然有优化操作，但是问题无法避免：1、大量的连接操作；2、大量中间结果。
Sideways Information Passing（SIP）生成过滤器处理类似的结果。

2.3 Using Graph Explorations

举例说明图搜索的优点，根据图的某个边一步步的扩展。可以剪枝不必要的结果。
但是以relational tables,triple stores, disk-based key-value stores存储的不比连接有效。
另外搜索顺序很重要。
阐述graph exploration 和 index-nested-loops join的区别。

3 System Architecture

Trinity.RDF is based on Trinity [30], which is a distributed in-memory key-value store. Trinity.RDF builds a graph interface on top of the key-value store. It randomly partitions an RDF graph across a cluster of commodity machines by hashing on the nodes.
然后并行的在每个服务器上进行查询，服务器交换数据以得到完整的结果。
这里写图片描述
讲述查询的流程，用户提交查询，然后代理服务器，生成查询计划，交给其他服务器执行，然后汇总中间结果，返回用户。
代理服务器作用：
First, it generates a query plan based on available statistics and indices. Second, it keeps track of the status of each Trinity machine in query processing by, for example, synchronizing the execution of each query step.
字符索引服务器：字符串与id的映射。

4 Data Modeling

4.1 Modeling Graphs

把RDF实体表示为键值对：

(node-id, < <script type="math/tex" id="MathJax-Element-40"><</script>in-adjacency-list, out-adjacency-list>) (1)

这里写图片描述
给定任意节点，我们可以找到其任意邻居的node_id，并且底层的Trinity内存云将该节点的键值对。这是我们能够通过访问途中任意顶点的邻接列表来搜索图。

4.2 Graph Partitioning

在搜索图时，两个因素会影响网络开销。
一是如何分割图。对节点id进行hash，即随机划分。
二是如何在键值存储的顶部建立图形。因为一些顶级顶点会有很多邻居，所以传输该节点集会造成巨大的开销。
通过以下键值对对节点x进行建模：

(node-id,< $in_1,...,in_k,out_1,...,out_k$ >) (2)
where $in_i$ and $out_i$ are keys to some other key-value pairs:
( $in_i$ , $in-adjacency-list_i$ ) ( $out_i$ , $out-adjacency-list_i$ ) (3)

4.3 Indexing Predicates

Graph exploration relies on retrieving nodes connected by an edge of a given predicate.
Local predicate indexing : 对应传统RDF中的SPO或OPS索引，另外添加了聚合索引。
Global predicate indexing ： The global predicate index enables us to find all nodes that have incoming or outgoing neighbors labeled by a given predicate.对应于PSO或POS索引。另外对于每个谓词还有一个键值对。
(predicate,

4.4 Basic Graph Operators

三种图形运算符(dir: direction，即入向还是出向)：
1.LoadNodes (predicate, dir): Return nodes that have an incoming or outgoing edge labeled as predicate.利用全局谓词变量在所有机器上找寻节点。
2.LoadNeighborsOnMachine(node, dir, i): For a given node, return its incoming or outgoing neighbors that reside on machine i. 找某个机器某个节点的所有incoming or outgoing neighbors，返回id。
3.SelectByPredicate(nid,predicate): From a given partial adjacency list specified by nid, return nodes that are labeled with the given predicate.

5 Query Processing

5.1 Overview

把SPARQL查询化为子图匹配问题。
利用内存中的基于键值存储的快速查询处理。
步骤：
1、拆分Q到一些列的三元组 $q_1, ... ,q_n$ .
2、对每个 $q_i$ 找到匹配，并拓展到 $q_{i+1}$ 。以此类推
3、在代理服务器汇总中间结果。

5.2 Single Triple Pattern Matching

匹配从一个三元模式开始。
对triple pattern q我们找到所有匹配的R(q)。令P表示q中的谓词，V表示q中的变量，B(V)表示V的binding（可能取值）。
两种图搜索的方式：
$\overrightarrow{q}$ 从主到宾
$\overleftarrow{q}$ 从宾到主
这里写图片描述
算法一用4.4节定义的运算符进行图匹配。
首先根据谓词确定所有的源点（src）。
然后针对每个源点s，遍历所有机器得到所有的（s, $nid_i$ ）组合，并发送到所有的机器。
然后遍历所有（s, $nid_i$ ）组合，对于所有的谓词，找到目标的可能值，加入到结果中。

5.3 Multiple Pattern Matching by Exploration

Instead of matching single patterns independently, we treat the query as a sequence of patterns. The matching of the current pattern is based on the matches of previous patterns, i.e., we “explore” the RDF graph from the matches of previous patterns to find matches for the current pattern. In other words, we eagerly prune invalid matchings by exploration to avoid the cost of joining large sets of results later.
按顺序对三元组进行查询，在每一步中可以修建无效的匹配。
在第一个情况，the source of exploration is bound. 除了第一步，源的取值由上一步的查询结果来确定。
在第二个情况，the target of exploration is bound.

5.4 Final Join after Exploration

剪枝了源点，则目标点也同样被剪枝，大大减少了中间结果大小。
但是剪枝后的结果也会有无效的中间结果，需要携带之前节点的所有可能值，增加了沟通成本。
our join phase is light-weight compared with traditional RDF systems that intensively rely on joins, and we simply adopt the left-deep join for this purpose.

5.5 Exploration Plan Optimization

把图搜索定义为图遍历计划，即<script type="math/tex" id="MathJax-Element-15"> </script>, $e_i=\overrightarrow{q_i}$ or $e_i=\overleftarrow{q_i}$ ，总共的成本在于 $e_i$ 的顺序。
和关系查询优化器调整join顺序的类似，但不相同。
In the relational optimizer, later joins depend on previous intermediary join results, while for us, later explorations depend on previous intermediary bindings. The intermediary join results do not depend on the order of join, while the intermediary bindings do depend on the order of exploration.
这个中间结果会随顺序不同而变化。
There are two ways to grow a subgraph: expansion and combination.
$\epsilon$ 表示图
$R(\epsilon)$ 表示中间连接结果
$B(\epsilon)$ 图中的可能取值。
exploration point：不包含冗余值的点。

Heuristic 1. We expand a subgraph from its exploration point. We combine two subgraphs by connecting their exploration points.

因为启发式规则1，有以下特性：

Property 1. We expand a subgraph or combine two subgraphs through an edge. The two nodes on both ends of the edge are valid exploration points in the new graph.

利用动归，初试状态子图大小为1，即只有单个边，
如果通过边 $q=c\rightsquigarrow v$ , 那么有两种状态：

(ϵ \cup {q}, v) a n d (ϵ \cup {q}, c)

$(\epsilon \cup \{q\},v) and (\epsilon \cup \{q\},c)$
用C代表扩展之前的cost，

C′ $C^{'}$ 代表扩展之后的，则递推式：

C' = m i n {C', C + c o s t (q \to)}

$C^{'}=min\{C^{'},C+cost(\overrightarrow{q})\}$
如果通过边

q=c1⇝c2 $q=c_1\rightsquigarrow c_2$ , 来连接两个状态

(ϵ1,c1) $(\epsilon_1,c_1)$ 和

(ϵ2,c2) $(\epsilon_2,c_2)$ 那么有两种状态：

(ϵ 1 \cup ϵ 2 \cup q, c 1) a n d (ϵ 1 \cup ϵ 2 \cup q, c 2)

$(\epsilon_1 \cup \epsilon_2 \cup q,c_1) and (\epsilon_1 \cup \epsilon_2 \cup q,c_2)$
用C代表扩展之前的cost，

C′ $C^{'}$ 代表扩展之后的，则递推式：

C' = m i n {C', C 1 + C 2 + c o s t (q \to)}

$C^{'}=min\{C^{'},C_1+C_2+cost(\overrightarrow{q})\}$

Theorem 1. For a query graph G(V, E), the DP has time complexity O(n·|V |·|E|) where n is the number of connected subgraphs in G.
Theorem 2. Any acyclic query Q with query graph G is guaranteed to have an exploration plan.

Discussion: 没有考虑两种情况：
1、被查询图是循环的。
这个可以通过复制一些变量来打破循环。
2、被查询图包含谓词连接。

5.6 Cost Estimation

首先提出Stocker 的方法，认为s，p，o是相互独立的，每个三元组的选择是三个选择的结果。
RDF-3X提出两种：一、认为三元组间独立，花费依赖于连接操作。二、找出常用的连接路径，并统计。
本文计算 $cost(e)$ 即 $cost(\overrightarrow q)$ ，即计算结果的大小即|R(q)|。
During exploration, we send bindings and ids of adjacency lists across network, so we measure communication cost as the binding size of the source node of the exploration, i.e. |B(src)|.

| R (q) | = | B (s r c) | C p C p ( s r c ), | R (t g t) | = | B (s r c) | C p ( t g t ) C p ( s r c )

$|R(q)|=|B(src)|\frac{C_p}{C_p(src)},|R(tgt)|=|B(src)|\frac{C_p(tgt)}{C_p(src)}$
for expansion, assume we expand through a new edge p2 from variable x which is already connected with p1. Assume the original binding size of x is

Nx $N_x$ . We have the new binding size

N′x $N^{'}_x$ as

N' x = N x C P 1 P 2 C P 1

$N^{'}_x=N_x\frac{C_{P_1P_2}}{C_{P_1}}$
The second case is combining two edges p1 and p2 on x. Assume the original binding sizes of x with predicate p1 and predicate p2 are Nx,1 and Nx,2 respectively. We have the new binding size

N′x $N^{'}_x$ as

N' x = N x, 1 N x, 2 C P 1 P 2 C P 1 C P 2

$N^{'}_x=N_{x,1}N_{x,2}\frac{C_{P_1P_2}}{C_{P_1}C_{P_2}}$

6 Evaluation

Systems We compare Trinity.RDF with centralized RDF-3X [27] and BitMat [8], as well as distributed MapReduce-RDF-3X (a Hadoop-based RDF-3X solution [20]).
Datasets 这里写图片描述
Join vs. Exploration We compare graph exploration (Trinity.RDF) with scan-join (RDF-3X and BitMat) on DBPSB and LUBM-160 datasets. The experiment results show that Trinity.RDF outperforms RDF-3X and BitMat; and more importantly, its superiority does not just come from its inmemory architecture, but from the fact that graph exploration itself is more efficient than join.
这里写图片描述
*Performance on Large Datasets

水木-刘

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
《A Distributed Graph Engine for Web Scale RDF Data》2013——笔记

ABSTRACT现有系统无法有效处理Web规模的RDF数据，不支持对RDF数据的许多有用和通用的基于图形的操作。本文使用Trinity.RDF，以原始图形式存储RDF数据，而不是三元组或者位图矩阵。IntroductionRDF数据越来越多。数据库管理系统面临两个挑战：systems’ scalability and generality. 1.目前以三元组为形式并使用RDBMS进行存储，索
复制链接

扫一扫