Background
At present, hundreds of TB of data is processed in Momo bigdata cluster every day. However, most of the data will be read/write through disk repeatedly, which is ineffective. In order to speed up data processing and provide better user experience, after some investigation we found that Alluxio may fit our need. Alluxio works by providing a unified memory speed distributed storage system for various jobs. I/O speed in memory is faster than in hard disk. Hot data in Alluxio could server memory speed I/O just like memory cache. So the more frequent data read/written over Alluxio, the greater the benefit will have. In order to better understand the value Alluxio have to our ad-hoc service which uses Spark SQL as executing engine, we designed a series experiments of Alluxio with Spark SQL.
Experiment Design
There are a few designs which aims to take advantage of Alluxio:
- Firstly, we use decoupled computer and storage architecture for the reason that mixed deployment will leave a heavy I/O burden to Alluxio, so DataNode is not deployed with Alluxio worker. The Alluxio cluster is decoupled from HDFS storage, it will read data from remote HDFS nodes for the first execution.
- Secondly, in order to mock the online environment, we use YARN node label feature to divide an Alluxio cluster from production cluster, which means the Alluxio cluster share the same NameNode and ResourceManager from production cluster and may be affected by the pressure of production cluster.
- Thirdly, there is only one copy of data stored in Alluxio, which means it can not guarantee high availability. What’s more, persisting data to second tier storage such as HDFS is low efficient and space wasteful. Considering about stability and efficiency, we choose to use Alluxio as a read-only cache in our experiment .
The figure below shows the deployment of Alluxio cluster with production cluster.
Figure 1. Alluxio with Spark SQL Architecture
The experiment environment of Alluxio cluster is the sam