Leveraging Alluxio with Spark SQL to Speed Up Ad-hoc Analysis

最新推荐文章于 2021-02-05 14:51:51 发布

xwc35047

最新推荐文章于 2021-02-05 14:51:51 发布

阅读量1.1k

点赞数

分类专栏： spark经验总结 alluxio 性能调优文章标签： spark alluxio sql performance test

本文链接：https://blog.csdn.net/xwc35047/article/details/79069361

版权

Background

At present, hundreds of TB of data is processed in Momo bigdata cluster every day. However, most of the data will be read/write through disk repeatedly, which is ineffective. In order to speed up data processing and provide better user experience, after some investigation we found that Alluxio may fit our need. Alluxio works by providing a unified memory speed distributed storage system for various jobs. I/O speed in memory is faster than in hard disk. Hot data in Alluxio could server memory speed I/O just like memory cache. So the more frequent data read/written over Alluxio, the greater the benefit will have. In order to better understand the value Alluxio have to our ad-hoc service which uses Spark SQL as executing engine, we designed a series experiments of Alluxio with Spark SQL.

Experiment Design

There are a few designs which aims to take advantage of Alluxio:

Firstly, we use decoupled computer and storage architecture for the reason that mixed deployment will leave a heavy I/O burden to Alluxio, so DataNode is not deployed with Alluxio worker. The Alluxio cluster is decoupled from HDFS storage, it will read data from remote HDFS nodes for the first execution.
Secondly, in order to mock the online environment, we use YARN node label feature to divide an Alluxio cluster from production cluster, which means the Alluxio cluster share the same NameNode and ResourceManager from production cluster and may be affected by the pressure of production cluster.
Thirdly, there is only one copy of data stored in Alluxio, which means it can not guarantee high availability. What’s more, persisting data to second tier storage such as HDFS is low efficient and space wasteful. Considering about stability and efficiency, we choose to use Alluxio as a read-only cache in our experiment .

The figure below shows the deployment of Alluxio cluster with production cluster.

这里写图片描述
Figure 1. Alluxio with Spark SQL Architecture

The experiment environment of Alluxio cluster is the sam

最低0.47元/天解锁文章

xwc35047

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Leveraging Alluxio with Spark SQL to Speed Up Ad-hoc Analysis

BackgroundAt present, hundreds of TB of data is processed in Momo bigdata cluster every day. However, most of the data will be read/write through disk repeatedly, which is ineffective. In order to s...
复制链接

扫一扫