Kylin - By-layer Spark Cubing

最新推荐文章于 2022-05-17 20:31:57 发布

天地不仁以万物为刍狗

最新推荐文章于 2022-05-17 20:31:57 发布

阅读量239

点赞数

分类专栏： Kylin

Kylin 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Before v2.0, Apache Kylin uses Hadoop MapReduce as the framework to build Cubes over huge dataset. The MapReduce framework is simple, stable and can fulfill Kylin’s need very well except the performance. In order to get better performance, we introduced the “fast cubing” algorithm in Kylin v1.5, tries to do as much as possible aggregations at map side within memory, so to avoid the disk and network I/O; but not all data models can benefit from it, and it still runs on MR which means on-disk sorting and shuffling.

在2.0版之前，ApacheKylin使用HadoopMapReduce作为框架在大型数据集上构建多维数据集。MapReduce框架简单、稳定，除了性能外，还可以很好地满足Kylin的需求。为了获得更好的性能，我们在Kylinv1.5中引入了“fast cubing”算法，试图在内存中尽可能多地在映射端进行聚合，以避免磁盘和网络I/O；但并非所有的数据模型都能从中受益，它仍然在MapReduce上运行，这意味着磁盘排序和洗牌。

Now Spark comes; Apache Spark is an open-source cluster-computing framework, which provides programmers with an application programming interface centered on a data structure called RDD; it runs in-memory on the cluster, this makes repeated access to the same data much faster. Spark provides flexible and fancy APIs. You are not tied to Hadoop’s MapReduce two-stage paradigm.

现在Spark出现了；Apache Spark是一个开源的集群计算框架，它为程序员提供了一个名为RDD的数据结构为中心的应用程序编程接口；它在集群的内存中运行，这使得重复访问同一数据的速度更快。Spark提供了灵活而别致的API。你并没有被Hadoop的MapReduce两阶段范式所束缚。

Before introducing how calculate Cube with Spark, let’s see how Kylin do that with MR; Figure 1 illustrates how a 4-dimension Cube get calculated with the classic “by-layer” algorithm: the first round MR aggregates the base (4-D) cuboid from source data; the second MR aggregates on the base cuboid to get the 3-D cuboids; With N+1 round MR all layers’ cuboids get calculated.

在介绍如何使用Spark计算 Cube 之前，让我们先看看Kylin如何使用 MR；图1说明了如何使用经典的“by-layer”算法计算4维立方体：第一轮 MR 从源数据聚合基（4-d）cuboid ；第二轮 MR 聚合基长方体以获得3-d cuboid ；用n+1次，计算出各层的cuboid 。

MapReduce Cubing by Layer

The “by-layer” Cubing divides a big task into a couple steps, and each step bases on the previous step’s output, so it can reuse the previous calculation and also avoid calculating from very beginning when there is a failure in between. These makes it as a reliable algorithm. When moving to Spark, we decide to keep this algorithm, that’s why we call this feature as “By layer Spark Cubing”.

“by-layer”立方体将一个大任务分为两个步骤，每个步骤都基于前一步骤的输出，因此可以重用前一个计算，也可以避免在两个步骤之间出现故障时从头开始计算。这使得它成为一种可靠的算法。当转移到Spark时，我们决定保留这个算法，这就是为什么我们将这个特性称为“By layer Spark Cubing”。

As we know, RDD (Resilient Distributed Dataset) is a basic concept in Spark. A collection of N-Dimension cuboids can be well described as an RDD, a N-Dimension Cube will have N+1 RDD. These RDDs have the parent/child relationship as the parent can be used to generate the children. With the parent RDD cached in memory, the child RDD’s generation can be much efficient than reading from disk. Figure 2 describes this process.

如我们所知，RDD（弹性分布式数据集）是Spark中的一个基本概念。一个n维立方体的集合可以很好地描述为一个RDD，一个n维立方体将有n+1 RDD。这些RDD具有父/子关系，因为父关系可用于生成子关系。当父RDD缓存在内存中时，子RDD的生成可能比从磁盘读取效率高。图2描述了这个过程。

Spark Cubing by Layer

Figure 3 is the DAG of Cubing in Spark, it illustrates the process in detail: In “Stage 5”, Kylin uses a HiveContext to read the intermediate Hive table, and then do a “map” operation, which is an one to one map, to encode the origin values into K-V bytes. On complete Kylin gets an intermediate encoded RDD. In “Stage 6”, the intermediate RDD is aggregated with a “reduceByKey” operation to get RDD-1, which is the base cuboid. Nextly, do an “flatMap” (one to many map) on RDD-1, because the base cuboid has N children cuboids. And so on, all levels’ RDDs get calculated. These RDDs will be persisted to distributed file system on complete, but be cached in memory for next level’s calculation. When child be generated, it will be removed from cache.

图3是Spark Cubing的DAG，它详细说明了这个过程：在“第5阶段”中，Kylin使用hiveContext读取中间的hive表，然后执行“map”操作（一对一映射），将原点值编码为k-v字节。完成后，Kylin得到一个中间编码的RDD。在“第6阶段”中，中间RDD通过“reduceByKey”操作进行聚合，得到RDD-1，即 base cuboid。接下来，在RDD-1上做一个“flatMap”（一对多的 map），因为基本长方体有n个子长方体。等等，所有级别的RDD都会被计算出来。这些RDD将在完成时持久化到分布式文件系统，但会缓存到内存中以供下一级计算。当生成子级时，它将从缓存中删除。

DAG of Spark Cubing

We did a test to see how much performance improvement can gain from Spark:

我们做了一个测试，看看Spark可以提高多少性能

Environment

4 nodes Hadoop cluster; each node has 28 GB RAM and 12 cores;
YRAN has 48GB RAM and 30 cores in total;
CDH 5.8, Apache Kylin 2.0 beta.

Spark

Spark 1.6.3 on YARN
6 executors, each has 4 cores, 4GB +1GB (overhead) memory

Test Data

Airline data, total 160 million rows
Cube: 10 dimensions, 5 measures (SUM)

Test Scenarios

Build the cube at different source data level: 3 million, 50 million and 160 million source rows; Compare the build time with MapReduce (by layer) and Spark. No compression enabled.
The time only cover the building cube step, not including data preparations and subsequent steps.
在不同的源数据级别构建多维数据集：300万、5000万和1.6亿个源行；将构建时间与MapReduce（按层）和Spark进行比较。未启用压缩。

时间只包括构建多维数据集步骤，不包括数据准备和后续步骤。

Spark vs MR performance

Spark is faster than MR in all the 3 scenarios, and overall it can reduce about half time in the cubing.

Now you can download a 2.0.0 beta build from Kylin’s download page, and then follow this post to build a cube with Spark engine. If you have any comments or inputs, please discuss in the community.

在所有3个场景中，Spark都比MR快，而且总的来说，它可以减少大约一半的工作时间。

现在你可以从Kylin的下载页面下载一个2.0.0测试版，然后按照这篇文章构建一个cube 用Spark引擎。如果您有任何意见或意见，请在社区中讨论。

consult:

http://kylin.apache.org/blog/2017/02/23/by-layer-spark-cubing/

天地不仁以万物为刍狗

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Kylin - By-layer Spark Cubing

Before v2.0, Apache Kylin uses Hadoop MapReduce as the framework to build Cubes over huge dataset. The MapReduce framework is simple, stable and can fulfill Kylin’s need very well except the performan...
复制链接

扫一扫

专栏目录