Machine Learning-based Selection of Graph Partitioning Strategy Using the Characteristics of Graph D

Machine Learning-based Selection of Graph Partitioning Strategy Using the Characteristics of Graph Data and Algorithm (Regular Papers)的论文介绍






Analyzing large graph data is an essential part of many modern applications, such as social networks. Due to its large computational complexity, distributed processing is frequently employed. This requires graph data to be divided across nodes, and the choice of partitioning strategy has a great impact on the execution time of the task. Yet, there is no one-size-fits-all partitioning strategy that performs well on arbitrary graph data and algorithms. The performance of a strategy depends on the characteristics of the graph data and algorithms. Moreover, due to the complexity of graph data and algorithms, manually identifying the best
partitioning strategy is also infeasible. In this work, we propose a machine learning-based approach to select the most appropriate partitioning strategy for a given graph and
processing algorithm. Our approach enumerates viable partitioning strategies, predicts the execution time of the target algorithm for each, and selects the partitioning strategy with
the fastest estimated execution time. Our machine learning model is trained on features extracted from graph data and algorithm pseudo-code. We also propose a method that augments real execution logs of graph tasks to create a large synthetic dataset. Evaluation results show that the strategies selected by our approach lead to 1.46× faster execution
time on average compared with the mean execution time of the partitioning strategies and about 0.95× the performance compared to the best partitioning strategy.
AIDB Workshop Reference Format: YoungJoon Park, DongKyu Lee, Tien-Cuong Bui. Machine Learningbased Selection of Graph Partitioning Strategy Using the Characteristics of Graph Data and Algorithm. AIDB 2021.


Graph data are prevalent in various fields, such as social networks [41], protein structures [20], web structures [24],textual structures [31], and e-commerce [46]. As the amount of graph data increases fast, distributed computing of graph analysis can be an effective approach for large-scale graph data. For example, it takes more than 10,000 seconds to calculate the local clustering coefficient of each vertex for the Clueweb12 data[4] which has about 6.3 billion vertices and about 66.8 billion edges using 25 machines[21]. There are several kinds of research about distributed graph processing. First, partitioning strategies [38, 6, 15, 49, 43, 33] were proposed to partition graph data into a cluster. Second, distributed graph processing engines [6, 11, 30, 12, 50, 35] emerged to analyze distributed graph data. Finally, parallel
algorithms [22, 16, 28, 29] emerged to exploit the distributed environment. We focus on selecting the best partitioning strategy.
A partitioning strategy determines how vertices and edges are divided into clusters, with the main differentiating points being communication cost, computation time, and replication
factor which means the ratio of the number of the replicated vertex to the number of the original vertex. Existing partitioning strategies can be categorized into model agnostic,
edge-cut partitioning, and vertex-cut partitioning[1], where each may consider locality and/or load-balancing. In this paper, we define a task to be a job that performs a specific
algorithm on a specific graph, and the performance of a partitioning strategy as the execution time of a task under the partitioning strategy after partitioning has finished. The motivation for our research is that the performance of a partitioning strategy is different depending on a task.
Figure 1 represents execution times of some tasks when they are executed with different partitioning strategies. The best partitioning strategy is represented in a dotted bar. The
worst partitioning strategy is represented in a diagonally striped bar. The best partitioning strategy to execute the All-Pair Common Neighborhood (APCN) algorithm for
the Web-Stanford graph data is ‘2D Edge Partition’ partitioning strategy while the worst of it is ‘Hybrid’ in Figure 1a. In cases with different algorithms PageRank and TriangleCount for the same graph data, however, the best strategies are ‘Hybrid’ and ‘Ginger’ respectively, and the
worst strategies are also different in Figure 1b, 1d. In addition, the same algorithm APCN and a different graph data Gemsec-HU show a different performance order also in Figure 1c. The best partitioning strategy for one task can be the worst strategy for another task as seen in Figure 1a, 1b,1c, 1e.


我们研究的动机是不同任务的分割策略性能不同。图1表示当它们使用不同的分割策略执行时的某些任务的执行时间。最佳分割策略用虚线表示。最差分割策略用对角线条纹条表示。对于Web-Stanford图像的所有对公共邻居(APCN)算法的最佳分割策略是“2D Edge Partition”分割策略,而Figure 1a中的最差策略是“Hybrid”。然而,在相同的图形数据下使用不同算法的PageRank和TriangleCount的情况下,最佳策略分别为“Hybrid”和“Ginger”,最差策略在Figure 1b、1d中也不同。此外,相同的算法APCN和不同的图数据Gemsec-HU也在图1c中显示出不同的性能顺序。一个任务的最佳分割策略可能是另一个任务的最差策略,如图1a、1b、1c、1e所示。在这里插入图片描述
Then, how can we find the most appropriate partitioning strategy that has the best performance for the task? We assume that comprehending the graph data and algorithm can
help select the best partitioning strategy. Several research [45, 36] compare performances of partitioning strategies and propose a decision tree to select the best partitioning strategy. However, they do not declare clear conditions to select decision paths in their decision trees. Also, their heuristic decision trees are not appropriate to cover cases with various
graph data and algorithms. Instead of empirical and heuristic selection of partitioning strategies, we take a machine learning approach that can be generally applied to various
graph data and algorithms. [40, 51] considers graph data to select the best partitioning strategy. They chose only one algorithm, PageRank, to compare the performance of partitioning strategies and did not consider algorithm characteristics. We instead extract graph data and algorithms’ features by carefully analyzing execution behaviors. By that, our method proposes the most suitable strategy to divide data across workers. Figure 2-○1 ,
○2 shows extracting the
features of the task. The task feature is the concatenation
of graph data statistics and algorithm execution pattern
features. Figure 2-○3 shows predicting the performance for
each partitioning strategy using the task feature. We used a
machine learning technique in this part, and our approach
is similar to the concept of software 2.0 supporting systems
using data-driven methods. There are research papers related to database configuration tuning [52, 44], relational
table partitioning [14] and cardinality estimation [47]. We
select the strategy with the fastest expected execution time
in Figure 2-○4 . Figure 2-○5 depicts the training process of
the Execution Time Regression Model (ETRM). We use the
augmented synthetic training dataset as the execution logs
and train the model using the loss between these logs and
the model’s outputs.
We encounter several challenges in designing and implementing the proposed model. First, we have to predict the
execution time of a task by extracting its features without
actually performing it. Next, we need to carefully analyze
both algorithms and graph data to extract useful features
that can be used for the strategy selection model. In addition, we need a large dataset to train the machine learning model. Creating a sufficient real execution log for the training dataset consumes much computing power, so we construct a synthetic training dataset by augmenting real execution logs. Finally, excluding some characteristics of several distributed graph engines, we have to implement an experimental distributed graph engine which the all graph algorithms run on and which covers various partitioning strategies.
We performed several experiments, and a list of experiments is as follows. i) How well our model can select the best partitioning strategy for test cases, ii) how superior the selected strategy’s performance is compared to other strategies, and iii) how much performance benefit our approach can get. The main contributions of our research are the following:
∙ We propose a method to choose the best partitioning strategy using extracted features from graph data and algorithms.
∙ We construct an experimental distributed graph engine, so only the factors for graph data, algorithm, and partitioning strategy be the experimental elements.
∙ We propose a method to generate synthetic training data to train our model. The rest of this paper is organized as follows. In Section 2, we summarize notations. In Section 3, we describe our distributed graph computation engine that we implemented for experimental purposes. In Section 4, we describe the set of features we extracted and how. In Section 5, we evaluate our method. In Section 6, we review related works. We conclude and propose future work in section 7.



∙ 我们提出了一种使用从图形数据和算法中提取的特征选择最佳分区策略的方法。

∙ 我们构建了一个实验性分布式图形计算引擎,使图形数据、算法和分区策略成为实验元素。

∙ 我们提出了一种生成合成训练数据来训练我们的模型的方法。



We summarize notations used in this paper in Table 1. These notations include the vertex set 𝑉 , edge set 𝐸, and neighbor vertex 𝑁(𝑢) related to the graph data 𝐺. In addition, Table 1 includes the worker set 𝑊, the partitioned vertex set and edge set used in distributed processing, and the notations used in the execution time regression model.


This section describes our graph computation engine, which serves as a test bed for comparing the partitioning strategy execution process. This paper focused on the partitioning
strategies related to the data and the algorithm. Therefore, our graph engine contains essential functions and operators that serve our purpose.

3.1 Graph Representation

As we focused only on the task’s performance, we simplified the implementation. Our graph engine used an edge list to represent the graph data. The edge list consists of vertex tuples, (𝑢, 𝑣). An inverted edge list is also maintained. Finding a vertex takes 𝑂(𝑙𝑜𝑔(|𝑉 |)) time. It takes 𝑂(𝑑𝑒𝑔𝑟𝑒𝑒(𝑣)) to search for an edge connected to an arbitrary vertex 𝑣 by managing a key-value hash map with vertex id as a key and the starting point of the edge list connected to this vertex as value. The edge list is sorted by source vertex ID. Thus, insertion and deletion are also ignored. In addition, vertex and edge properties are stored in each key-value map.

3.2 Distributed Computation Model

3.2.1 GAS Model for Distributed Computing

Among several distributed graph computation models, we selected the GAS model[11]. Hadoop MapReduce[42] is general and can be used in various applications, but is not suitable because it uses HDFS[3], which can cause excessive I/O and it may run unnecessary shuffle operations. TUX2 [48]proposed the MEGA model, which is optimized for graph machine learning algorithms. We didn’t adopt the MEGA model because we target more general graph processing tasks instead of specific graph ML tasks The GAS model is a vertex-centric model[32] and ‘GAS’ stands for Gather, Apply and Scatter. While partitioning edges separately, vertices that exist commonly in partitioned edges are replicated. The GAS model sets one vertex as the master vertex and the others as mirror vertices with the same vertex ID vertices. Workers have a queue for representing vertices that will be processed locally. Each
worker pops a vertex from the queue and propagates it to the corresponding workers having mirror vertices. For each of these vertices in the Gather phase, the engine collects all
mirrors’ local results and aggregates them. In the Apply phase, the master vertex’s aggregated result is transmitted to mirror vertices. In the Scatter phase, the vertex’s aggregated result is used to update its adjacent edges. The neighbor vertices are enqueued if these neighbor vertices are needed to be computed. This activation occurs based on the local
neighbor, and this result is shared between workers. Vertex 3’s GAS step is illustrated in Figure 3. In this example, 𝑣3’s partial result is computed in each worker and aggregated to
worker 0. Then, this aggregated value is updated to mirror vertices. Lastly, in this example, out-neighbor 𝑣5 is activated and en-queued.

3.2.2 Scalability

We tested the scalability of our engine to show that theimplementation is scalable enough to conduct our experiments. This result can be seen in Figure 4. This experiment consisted of 4, 8, 16, 32, and 64 workers on four identical machines. The specification of one machine is 32 cores, Xeon X7560 2.27GHz, 500GB RAM. Machines communicate using 10 Gbps NICs. PageRank and TriangleCount algorithms were performed for Web-Stanford data. We could see that execution time decreased for two algorithms up to 64 workers.

3.1 图表示方法


3.2 分布式计算模型

3.2.1 基于GAS模型的分布式计算

在几种分布式图计算模型中,我们选择了GAS模型[11]。Hadoop MapReduce[42]是通用的,可以用于各种应用程序,但不适合我们的使用场景,因为它使用HDFS[3],可能会导致过多的I/O和不必要的洗牌操作。TUX2 [48] 提出了MEGA模型,该模型针对图形机器学习算法进行了优化。我们没有采用MEGA模型,因为我们的目标是更通用的图形处理任务,而不是特定的图形机器学习任务。GAS模型是一个以顶点为中心的模型[32],‘GAS’代表“We Gather, We Apply, We Scatter”(我们收集、我们应用、我们散布)。在单独分区边的同时,存在于分区边中的顶点是复制的。GAS模型将一个顶点设置为主节点,其他顶点作为具有相同顶点ID的镜像顶点。工作者维护一个队列,用于表示将在本地处理的顶点。每个工作者从队列中弹出一个顶点并将其传播到具有镜像顶点的相应工作者。在聚集阶段,对于每个镜像顶点,引擎收集所有镜像的本地结果并进行聚合。在应用阶段,将主节点的聚合结果传输到镜像顶点。在散布阶段,使用顶点的聚合结果更新其相邻边。如果需要计算这些相邻顶点,则将这些相邻顶点入队。这个激活是基于本地相邻节点完成的,并且此结果在工作者之间共享。图3描绘了顶点3的GAS步骤。在这个例子中,𝑣3的部分结果在每个工作者中计算并聚合到工作者0中。然后,将该聚合值更新到镜像顶点。最后,在这个例子中,出邻居𝑣5被激活并入队。

3.3 Partitioning Method
The partitioning methods used in our test bed were selected based on the following criteria: i) commonly used in many systems, and ii) proper to processing model. We selected GAS as the distributed graph processing model, and accordingly, we employed partitioning methods supported by representative GAS systems such as GraphX, PowerGraph, and PowerLyra. The following describes the partitioning methods supported by our engine. Table 2 shows a brief summary of each partitioning strategy.

3.3 分区方法



To prove our hypothesis that we can choose a better partitioning strategy by analyzing the data and the algorithm, we extracted certain features. The structural properties of graph data are summarized by statistic values, and symbolic code analysis on the pseudo-code of the algorithm is conducted to get the algorithm features. Finally, a machine learning model is constructed to predict how long does it take for given tasks. The overall process can be seen in Figure 2. Section 4.1 explains how the features are extracted from the graph data and algorithm. Section 4.2 explains the Execution Time Regression Model that predicts the performance of partitioning strategy using machine learning.

4.1 Feature Extractor

4.1.1 Data Feature

Various inherent features in the graph data were selected for the following reasons and summarized in Table 3.The number of vertices and edges is helpful in analyzing the iteration over the entire graph and predicting the iteration’s execution time. Graph data always represent the relationship between vertices as edges, and graph data analysis commonly accompanies access to edges. Therefore, it is necessary to understand the graph topology because the degree of vertices and their distribution vary according to the topology. We extracted mean, standard deviation, skewness, and kurtosis from each vertices’ in-degree and out-degree. Since skewness and kurtosis can have negative values, they are divided into a sign and absolute value and used as input features. Furthermore, it is essential to consider whether the graph is directed because some operators behave differently (e.g. get in-neighbors of some vertex, inverted edge list).

4.1.2 Algorithm Feature

The frequency of graph operations are evaluated to capture the pattern and scale of data access for executing the graph algorithm. We wrote a code analyzer with JavaCC, a compiler-compiler tool like YACC. It analyzes the pseudocode which consists of graph operators supported by our engine. The operators are listed in Table 4. The pseudo-code has symbols representing graph elements such as ALL VERTEX LIST, GET IN VERTEX TO as seen in Listing 1. Parsing the code, the number of each graph and arithmetic operation is counted. As a result, the key-value pairs as operation-count will be generated as seen in line 1 of Listing 2. Even if it cannot be evaluated as a real value, the count is represented by a symbolic expression. In order to fill in those symbols with real values, data features are used to evaluate the symbols. For example, a graph operation GET IN VERTEX TO(v) in line 10 of Listing 1 is found at the condition of for loop in line 10. The number of this operation
can be evaluated by the multiplication of outer loop variables, |ALL VERTEX LIST|*iterator num. The number of iteration can be found in line 1 of Listing 1, so it is immediately evaluated as 20.0. The size of all vertex set is trivially same as |𝑉 |, the cardinality of the vertex set of graph. Thus, the number of all vertex can be taken from data feature 𝐷𝐹. The graph data in this example was Ego-Facebook[25] and |𝑉 | is 4039 so, the final counting value of GET IN VERTEX TO becomes 4039 * 20.0 = 80780.0. The accesses to variables and arithmetic operators are also counted to precisely

4.2 Execution Time Regression Model

We implemented a prediction model to select the best partitioning strategy in a given task. We tried some machine learning models such as linear regression, XGBoost[7], LightGBM[20], multi-layer perceptron and mixture of experts [17]. The best model was the XGBoost regression model.The training process and model structure of the Execution Time Regression Model are as follows.

4.2.1 Data Preparation

We executed tasks encompassing the graph data, algorithms, and partitioning strategies on our engine and recorded execution logs. The graph data and algorithm in the execution log are mapped to the data feature and the algorithm feature respectively by the feature extractor. We prepared total execution logs using 12 graph data, 8 algorithms, and 11 partitioning strategies except Oblivious strategy. To train our machine learning model, we had to prepare a
huge amount of training set. Among those execution logs, 528 logs made by 8 graph data, 6 algorithms, and 11 partitioning strategies were used to create the augmented training dataset.
We generated synthetic data via aggregation of multiple real data records. Aggregation is the summation of the algorithm feature and execution time by grouping the logs performed
in the same graph data with the same partitioning strategy. The task of a synthetic tuple is interpreted as one large algorithm with several algorithms performed sequentially.
Therefore, the algorithm features that predict the number of calls for the low-level function and the execution time can be aggregated by summation. For example, if a synthetic tuple
𝑠 is created via aggregation of real tuples 𝑟1, 𝑟2, …, 𝑟𝑛, then tuple 𝑠’s algorithm feature is 𝐴𝐹(𝑠) = ∑︀𝑛𝑖 𝐴𝐹(𝑟𝑖). It’s data feature is 𝐷𝐹(𝑠) = 𝐷𝐹(𝑟1) = … = 𝐷𝐹(𝑟𝑛) and execution
time is 𝐸𝑇(𝑠) = ∑︀𝑛 𝑖 𝐸𝑇(𝑟𝑖).We used combinations with replacement to make the synthetic algorithms. The formula is as follows.
(𝑛, 𝑟) = (𝑛 + 𝑟 − 1)!
𝑟!(𝑛 − 1)! (3)
We created synthetic algorithms by using 6 original algorithms and changing r from 2 to 9. The number of synthetic algorithms is ∑︀9𝑟=2 𝐶𝑅(6, 𝑟) = 4998. Our synthetic training dataset has about 0.43 million tuples by multiplying 4998 synthetic algorithms, 8 graph data and 11 partitioning strategies. Since each sample of the synthetic dataset has a different combination of algorithms, the entire synthetic dataset can be interpreted as a record of performing various and unique algorithms. The augmented training dataset does not include the original 528 real records. We used 528 records and records from other 4 graph data and 2 algorithms in the test phase.

4.2.2 Model Architecture

Our model captures the features of data and an algorithm to predict their execution time of a task for a given partitioning strategy. XGBoost regression model includes regularization term and uses Classification And Regression Tree (CART) ensemble model. This ensemble model decides whether to split the branch to maximize the Gain and minimize the Objective function. The formula is as follows.
(𝑡) = prediction of the i-th instance at the t-th iteration
(4)(5) (6) (7)
𝐼𝐿 = the instance sets of left node after the split (8) 𝐼𝑅 = the instance sets of right node after the split (9)
𝐼 = 𝐼𝐿 ∪ 𝐼𝑅 (10)
𝜆 = L2 regularization term on weights (11)
𝛾 =
minimum loss reduction required to make a
further partition on a leaf node of the tree (12)
𝐺𝑎𝑖𝑛 (13)
Ω = regularization term (14)
𝑓𝑘 = k-th decision tree in function space F (15)
𝑜𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑒 =|𝑆𝑎𝑚𝑝𝑙𝑒𝑠∑︁|𝑖=1𝑙𝑜𝑠𝑠(𝑦𝑖, 𝑦ˆ𝑖(𝑡−1)) +∑︁𝐾𝑘=1Ω(𝑓𝑘) (16)
We used the XGBRegressor model and the detailed parameters of the Regressor are as follows.
∙ colsample bytree = 0.4603
∙ gamma = 0.0468
∙ learning rate = 0.05
∙ max depth = 15
∙ min child weight = 1.7817
∙ n estimators = 1000
∙ reg alpha = 0.4640
∙ reg lambda = 0.8571
∙ subsample = 0.5213
∙ objective = squared error
The input of the model is expressed as 𝑋. 𝑋 has both graph data features 𝑋𝐺 and algorithm features 𝑋𝐴 pre-processed with scaling and one-hot encoding. Figure 5 describes the model’s input data.


为了证明我们能通过分析数据和算法选择更好的分区策略的假设,我们提取了一些特征。图数据的结构性质由统计值进行总结,对算法伪代码进行符号代码分析以获得算法特征。最后,构建机器学习模型以预测给定任务所需时间。整个过程如图2所示。第4.1节解释了如何从图数据和算法中提取特征。第4.2节解释了执行时间回归模型,该模型使用机器学习预测分区策略的性能。 图数据中的各种固有特征被选择出来,原因如下,并在表3中总结。顶点和边的数量有助于分析整个图的迭代,并预测迭代的执行时间。

4.1.2 算法特征

该伪代码具有代表图元素的符号,例如在清单1中看到的ALL VERTEX LIST、GET IN VERTEX TO等。解析代码时,每个图和算术运算的数量都会被计算。因此,键值对作为操作计数将生成,如清单2的第1行所示。即使它不能被评估为实际值,计数也用符号表达式表示。为了用实际值填充这些符号,使用数据特征来评估符号。例如,在清单1的第10行找到的图操作GET IN VERTEX TO(v)在循环的条件中发现。此操作的数量可以通过外部循环变量|ALL VERTEX LIST|*iterator num的乘法来评估。迭代次数可以在清单1的第1行找到,因此它立即评估为20.0。所有顶点集的大小与图的顶点集的势𝑉 的基数显然相同。因此,所有顶点的数量可以从数据特征𝐷𝐹中获得。
此示例中的图形数据是Ego-Facebook[25],|𝑉 |为4039,因此GET IN VERTEX TO的最终计数值为4039 * 20.0 = 80780.0。

4.2 执行时间回归模型


4.2.1 数据准备

𝑖 𝐴𝐹(𝑟𝑖)。其数据特征为𝐷𝐹(𝑠)=𝐷𝐹(𝑟1)=…=𝐷𝐹(𝑟𝑛),执行时间为𝐸𝑇(𝑠)=∑︀𝑛
𝑖 𝐸𝑇(𝑟𝑖)。
(𝑛, 𝑟) = (𝑛+𝑟−1)!
𝑟!(𝑛−1)! (3)
… … …

  • 0
  • 0
    觉得还不错? 一键收藏
  • 打赏
  • 2


  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
评论 2




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则




¥1 ¥2 ¥4 ¥6 ¥10 ¥20



钱包余额 0


