Leveraging Alluxio with Spark SQL to Speed Up Ad-hoc Analysis

Background

At present, hundreds of TB of data is processed in Momo bigdata cluster every day. However, most of the data will be read/write through disk repeatedly, which is ineffective. In order to speed up data processing and provide better user experience, after some investigation we found that Alluxio may fit our need. Alluxio works by providing a unified memory speed distributed storage system for various jobs. I/O speed in memory is faster than in hard disk. Hot data in Alluxio could server memory speed I/O just like memory cache. So the more frequent data read/written over Alluxio, the greater the benefit will have. In order to better understand the value Alluxio have to our ad-hoc service which uses Spark SQL as executing engine, we designed a series experiments of Alluxio with Spark SQL.

Experiment Design

There are a few designs which aims to take advantage of Alluxio:

  • Firstly, we use decoupled computer and storage architecture for the reason that mixed deployment will leave a heavy I/O burden to Alluxio, so DataNode is not deployed with Alluxio worker. The Alluxio cluster is decoupled from HDFS storage, it will read data from remote HDFS nodes for the first execution.
  • Secondly, in order to mock the online environment, we use YARN node label feature to divide an Alluxio cluster from production cluster, which means the Alluxio cluster share the same NameNode and ResourceManager from production cluster and may be affected by the pressure of production cluster.
  • Thirdly, there is only one copy of data stored in Alluxio, which means it can not guarantee high availability. What’s more, persisting data to second tier storage such as HDFS is low efficient and space wasteful. Considering about stability and efficiency, we choose to use Alluxio as a read-only cache in our experiment .

The figure below shows the deployment of Alluxio cluster with production cluster.

这里写图片描述
Figure 1. Alluxio with Spark SQL Architecture

The experiment environment of Alluxio cluster is the sam

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值