《Scalable SPARQL Querying using Path Partitioning》

最新推荐文章于 2020-01-13 21:33:23 发布

水木-刘

最新推荐文章于 2020-01-13 21:33:23 发布

阅读量584

点赞数

分类专栏：论文笔记

本文链接：https://blog.csdn.net/u013319237/article/details/71214585

版权

论文笔记专栏收录该内容

20 篇文章 2 订阅

订阅专栏

ABSTRACT

对大RDF图进行复杂查询的需求，要求查询的scalable。分区间查询费事，本文提出新的数据划分方法，利用了RDf数据集中丰富的结构信息，减少了分区间连接，效果很好。

INTRODUCTION

RDF增长——超出单机运算能力。
RDF表形式——图形式，举例图1（a）
SPARQL——建模为图，举例图1（b）
在scale-out RDF 数据处理系统中，RDF在被分区到不同的计算节点。需要好的分区策略，减少分区间连接。
常用的划分方法是散列分区，可以对拆分成星型的子查询快速的并行查询，但是中间结果的连接代价昂贵。
“Scalable SPARQL querying of large RDF graphs”提出划分图的方式，并复制n跳的内容，可以将查询分解为长度为2*n的子查询。
如果复制部分有度比较高的节点，则会产生较大的数据偏移。“Scaling queries over big rdf graphs with semantic hash partitioning”用哈希的方法替换图形分割步骤来解决数据偏度问题，但是由于顶点扩展，数据复制问题仍然存在。
本文同时考虑RDF图和查询图的结构，引入了端到端的路径的概念，有较长路径的复杂查询不必分解为大量子查询。提出顶点合并技术，可以减少冗余和内部分区子查询的数量。
本文贡献：(好麻烦。。。

We propose a new RDF data partitioning framework, which adopts the end-to-end path as the basic partition element and employs vertex merging to combine paths into partitions. We formally formulate the balanced path partitioning problem under this new framework and proof the problem is NP-Hard and APX-Hard.
In view of the hardness of the problem, we introduce a new version of the problem with a relaxed balance constraint. Then we propose an approximate algorithm that provides a performance guarantee.
To enhance the efficiency, we also present two bottom-up path merging algorithms to partition the paths. The resulting data partitioning can localize many queries with complex structures, such as star, chain, cycle and tree, while maintaining low data duplication and data skewness.
We propose a partition-aware query decomposition method to decompose a complex SPARQL query to minimize the possible cross-node communication. Our data partitioning method allows a complex query to be decomposed into fewer number of subqueries and hence be evaluated more efficiently.
To perform a fair comparison with the state-of-the-art approaches [15], [19], we implement an experimental system by adopting a similar architecture as proposed in [15], [19], where each single node RDF store is powered by RDF-3X [22] and cross-node communication is implemented on a Hadoop platform. We conduct an extensive experimental study on LUBM [11] and Sp2 Bench [25] benchmarks as well as a large real-world RDF dataset UniProt [6]. The results show that our method outperforms the previous approaches by up to two orders of magnitude, especially for complex queries.

基于Hadoop的RDF数据系统将RDF以HDFS文件形式存储，并使用vanilla Hadoop的文件划分和放置策略来分布式这些文件。应设计数据分区算法和数据本地化策略，降低I/O成本和通信开销。
散列分割是最流行的分割算法，对于星型查询适合，但是对于链式或其他效率低下，基于图划分的存在数据分布偏差和大量数据复制的问题。
分区另一个方向是动态运行时数据分区，本文的算法可以用作相关算法的初始分区方法，可提高性能。
也可用于Trinity.RDF和TriAD中，降低通信成本，提高性能。

PERLIMINARIES

A. RDF Graph and Sparql Query

Definition 1(RDF Graph) $G=(V,E,L_E)$ 没有入度的是源点，没有出度的是汇点。

B. End-To-End Path

Definition 2 (End-to-End Path)
Let G be an RDF graph. A path $v_0e_1v_1e_2v_2...e_mv_m$ is called an end to end path if it satisfies all the following conditions: (i) $v_0$ is a source vertex or one of the vertices in a directed cycle that does not contain any vertex with incoming edges from vertices outside of the cycle, (ii) $v_m$ is a sink vertex or there exists a vertex $v_i$ , in this path and $v_m=v_i$ . We call $v_0$ as the start vertex and $v_m$ as the end vertex.

端到端路径简称为路径。

Theorem 1
Given an RDF graph G, if it is decomposed into a set of paths according to Definition 2, then every vertex v and every edge (u,w) in G exist in at least one path.

PROBLEM FORMULATION

A. Path Partitioning Model

Definition 3 (k-way Path Partitioning Plan)
Given an RDf graph G=(V,E), a k-way path partitioning plan P over G is to divide all the end-to-end paths of G into k nonempty and disjoint partitions $\{P_1, ..., P_k\}$ , where $P_i$ contains an exclusive subset of end-to-end paths
Theorem 2
Given a path partitioning plan P, all queries that only contain S-O joins (i.e. chain and directed cycle queries) are inner-partition queries.
Definition 4 (Merged Vertex)
A vertex v is called merged if all paths that contain v are in the same partition.

B. Query Decomposition Model

把查询分解为链式或只包含S-O连接的子查询。
如果两个子查询的连接有共享点，则合成一个子查询。

Theorem 3
Given two inner-partition subqueries $SQ_i$ and $SQ_j$ that share a set of vertices $V_{i,j}$ , the join between $SQ_i$ and $SQ_j$ can be evaluated locally, if there exists one vertex $v\in V_{i,j}$ such that all matching vertices of v in the RDF graph are merged.
两个子查询有一个合并点集，如果能在RDF图中找到一个合并的点，则这两个子查询可以合并。

C. Metrics for Path Partitioning

Balance

$|L(P_i)|$ 分区中的路径数量。

Data Duplication

一些路径可能共享公共边和顶点，如果被分配给不同的分区，则三元组会重复。可定义为：

D u p (P) = 1 | E | \sum e \in E (| P (e) | - 1) (1)

$Dup(\mathscr{P})=\frac{1}{\vert E\vert }\sum_{e\in E}(\vert \mathscr{P}(e)\vert -1) \tag{1}$
其中

|P(e)| $|\mathscr{P}(e)|$ 代表路径划分方案P中e的拷贝数量。

Query-Efficiency

V+表示合并的顶点集合，合并的顶点越多，我们可以组合的子查询越多，查询效率取决于合并顶点的数量。

Q E (P) = | V + | (2)

$QE(\mathscr P)=|V_+| \tag{2}$

D. Problem Statement

Theorem 4
$Dup(\mathscr P)$ 满足以下性质：

$D u p (P) \leq ( | V | - | V + | ) 2 | E | (k - 1)$ $Dup(\mathscr P)\le \frac {(|V|-|V_+|)^2}{|E|}(k-1)$

定理四表明合并顶点的数量越大，重复率越小。因此，The (k, 1)-balanced path partitioning problem, denoted by (k, 1)-BPP problem.

Definition 5 ((k, 1)-BPP Problem)
Given an RDF graph G, find a k-way path partitioning plan P with the following objective functions:

$M a x i m i z e | V + | s . t . | L (P i) | \leq ⌈ n k ⌉, 1 \leq i \leq k$ $\mathrm{Maximize}| V_+| s.t. \vert L(P_{i})\vert \leq\lceil\frac{n}{k}\rceil, 1\leq i\leq k$
Theorem 5
The (k,1)-BPP problem is NP-hard, APX-Hard.

APPROXIMATE ALGORITHM

(k,1)-BPP问题难以估计。因此将问题转化为(k, 2)-BPP，即找一个k-way path partitioning plan在每个分区包含至多 $\lceil \frac{2n}k \rceil$ 个路径的条件下，最大化 $|V_+|$ 。
在路径长度为固定值l的前提下，对(k, 2)-BPP提出了一个近似算法。在所有分区路径数量不超过 $\lceil \frac{2n}k \rceil$ ，合并点的数量至少为 $(1-e^{-\frac1{kl}})|V^*_+|$ 的条件下，进行k-way path partitioning plan.
算法分为两个阶段，第一个阶段尽可能的产生合并节点，同时保证所有路径组的路径数少于或等于 $\lceil \frac{n}k \rceil$ ，然后把这些路径组均匀的分为k组，使得没有分区包含多于 $\lceil \frac{2n}k \rceil$ 的路径。
算法1
首先对符号进行介绍， $l_{ep}$ 代表路径 $ep$ 的长度。 $\mathscr E(v)$ 是经过点v的所有路径的集合，合并一个节点的利润为1，合并顶点的权重表示为 $w(v)=\sum_{ep\in\mathscr E(v)}\frac 1 {l_{ep}}$ ，权重越大意味着合并该节点会得到一个较大的路径组。所以合并节点v的利润权重比为 $\frac 1{w(v)}$ 。
算法开始时，需要生成RDF图中的所有路径。这关键在于起始点集的生成。首先找到所有的源点，另外在循环中找到ID最小的点加入到起始点集。得到起始的点集后用深度优先搜索得到所有的路径。
初始时没有顶点被合并，所以每个路径组包含一个路径（第3行）。然后从点集V中每两个顶点，如果合并这两个顶点后最大数量不超过限制，我们用贪心的启发式规则合并节点。

Theorem 6:
算法1得到了一个近似因子 $(1-e^{-\frac 1{kl}})$ ，即 $|V^{'}_+|\ge(1-e^{-\frac 1{kl}})|V^{*}_+|$ 其中 $V^{'}_+$ 表示算法1产生的合并顶点，其中 $V^{*}_+$ 表示(k, 1)-BPP产生的合并节点

BOTTOM-UP PATH MERGING ALGORITHM

A. Merging Start vertices

S-S连接和S-O连接最为常见，即星型，链式，循环，树形查询。
自底而上的路径合并算法主要有两步：(1)合并所有的start vertices，即所有的start vertices都在 $V_+$ 中(2)设计了一种新的衡量剩余节点的方式，并按照这个排序进行合并。
第一步有两个好处，
1. 可以让所有的星型，链式，循环，树形查询作为分区内查询。
2. 可以减少空间复杂度。

如果所有的start vertices都被合并了，那么所有的星型，链式，循环，树形查询都可作为分区内查询（对应好处1）。

B. Vertex Weighting

在合并完所有的start vertices后，我们需要利用profit-weight比值对剩余节点排序。这里的profit仍是1，但weight值与上一个算法不同，合并共享较多路径的顶点可能得到较大的路径组，因此 $w^{'}(v)=N_p(v)$ ， $N_p(v)$ 是包含v的路径数，它的值在不产生全部路径的情况下可由下面的方法进行估算。

　Theorem 8:
Given an RDF graph G=(V,E), $N_p(v)=I_p(v)\times O_p(v)$
其中 $I_p(v)$ 表示从开始节点到v的路径数量（不包含环）， $O_p(v)$ 表示从v到终止节点的路径数量。

由上述定理，我们可以得出下式：

I p (v) = \sum u \in I (v) I p (u) a n d O p (v) = \sum u \in O (v) O p (u) (4)

$I_p(v)=\sum_{u\in I(v)}I_p(u) and O_p(v)=\sum_{u\in O(v)}O_p(u) \tag{4}$
其中

I(v) $I(v)$ 表示所有v的in-neighbors，

O(u) $O(u)$ 表示u的所有out-neighbors。
接下来讨论

Ip(v) $I_p(v)$ 的计算方法，

Op(v) $O_p(v)$ 类似。

I p (v) k = (1 - α) + α \cdot \sum u \in I ( v ) I p ( u ) k - 1 \sum u \in I ( v ) ( I p ( u ) k - 1 ) 2 - - - - - - - - - - - - - - - \sqrt (5)

$\begin{equation*} I_{p}(v)_{k}=(1-\alpha)+\alpha\cdot\frac{\sum_{u\in I(v)}I_{p}(u)_{k-1}}{\sqrt{\sum_{u\in I(v)}(I_{p}(u)_{k-1})^{2}}} \tag{5} \end{equation*}$
其中

Ip(v)0=1 $I_p(v)_0=1$ ,

α $\alpha$ 代表衰减系数，

∑u∈I(v)(Ip(u)k−1)2−−−−−−−−−−−−−−−√ $\sqrt{\sum_{u\in I(v)}(I_{p}(u)_{k-1})^{2}}$ 用作规范化。
让A表示RDF图G的邻接矩阵，

Ip $I_p$ 表示所有顶点的预估值，则

I p = (1 - α) \cdot 1 + α \cdot A \cdot I p | | A \cdot I p | | 2 (6)

$I_p=(1-\alpha) \cdot1+\alpha \cdot \frac{A\cdot I_p}{||A\cdot I_p||_2} \tag{6}$

Theorem 9: 公式6是收敛的

C. Class-based Vertex Weighting

本节主要针对含有rdf:type标签的RDF图，计算每个顶点的权重。
可以根据rdf:type提供的类信息，做出更有效率的查询分解。
对两个子查询进行合并时，需要其某个共享点v是merged vertex。如果点v是固定值，则在未分区的RDF图中只对应一个点，较容易判断是否被merged；如果点v是变量，如果不执行查询的话，我们不知道有多少个匹配顶点，以及他们是否被merged。但如果我们知道RDF中v所属的类的顶点都已经被merged，那么我们可以将两个子查询合并。

Definition 6 (Class-based Vertex Weighting): Given an RDF graph G=(V,E), let $C=\{T(v)|v\in V\}$ be a set of classes of the vertices, where T(v) represent the class of v. We use the average weight score of all the vertices in class $C_i(C_i\in C)$ as the weight of $C_i$ , denoted as $w_{class}(C_i)$

然后，所有同一类的点v被赋予同样的权重，即

w c l a s s (v) = w c l a s s (T (v)), T (v) \in C (7)

$w_{class}(v)=w_{class}(T(v)), T(v)\in C \tag{7}$

D. The Complete Algorithm

自底向上路径合并算法的细节。
算法分为两个阶段，第一个阶段，算法2对每个节点v找到它的start vertices ——S(v)。算法2
算法3

QUERY DECOMPOSITION

给定一个查询并进行分解 $SQ=\{SQ_1, ...,SQ_m\}$ ，每个 $SQ_i$ 包含一个起始点和该点对应的边。为进一步减少子查询的数量，需对子查询进行合并。如果两个子查询的重合点中有固定值且查询后发现是合并点，则两个子查询可以合并；或者共享点是变量，但该类点都是合并点，同样可以合并。

EXPERIMENT

20台电脑，每个都是双核2.4GHz, 6GB内存，500G硬盘。
实验对比的方法：
Path-AX——本文中的近似算法
Path-BM——本文中的自底而上的方法使用节点权重
Path-BMC——本文中的自底而上的方法使用类别权重
[J. Huang, D. Abadi, and K. Ren. Scalable SPARQL querying of large
RDF graphs. PVLDB, 4(11):1123–1134.]中无方向一跳，无方向两跳。
[K. Lee and L. Liu. Scaling queries over big rdf graphs with semantic
hash partitioning. PVLDB, 6(14):1894–1905, 2013.] 前向两跳。
使用了九个数据集，如表1：

查询如表2：

表3表示了不同路径划分算法的区别（下面实验只考虑Path-BMC）：
这里写图片描述