Using Presto to combine data from Hive and MySQL in one SQL-like query

presto-og

Tutorial: Using Presto to combine data from Hive and MySQL in one SQL-like query

 Tutorials •  Comments (4)

Presto (originated at Facebook) is a yet another distributed SQL query engine for Hadoop that has recently generated huge excitement. What makes Presto so interesting, especially, in comparison to other existing engines like Cloudera Impala or Hive on Tez? The list of key differentiators includes:

  • Presto seems to be great for ad-hoc analysis – it runs queries quickly and scales from gigabytes to petabytes.
  • A single Presto query can process data from multiple sources e.g. HDFS, MySQL, Cassandra or even Kafka.
  • Large-scale companies like FacebookNetflix, Airbnb or Dropbox like it and actively contribute to it.

Because (a) the excitement about Presto is evident, (b) running own benchmarks is time-consuming, in this tutorial we focus on the feature that allow us to process data from HDFS and MySQL in a single Presto query. Such a feature is quite unique, because it’s hasn’t been provided by other open-source projects like Hive and Impala (Impala, however, can process Hive and HBase table in a single query).

Presto Connectors

First, a bit of technical background. Presto supports pluggable connectors that provide metadata and data for queries. There is Hive connector (currently supports Text, SequenceFile, RCFile, ORC and, in a limited way, Parquet formats), MySQL connector and several others. You can even implement a custom one.

A coordinator (a master daemon) uses connectors to get metadata (such as table schema) that is needed to build a query plan. Workers use connectors to get actual data that will be processed by them.

Presto Plugins

Our use-case

To demonstrate this feature, we use the Last.fm Dataset – 1K users dataset provided by Last.fm. It contains profile information (such as gender, age, country) about 1K users plus tracks that these users had been listening to.

In our tutorial, user profile information (40 kilobytes) will be kept in a MySQL table (something that actually makes sense in real-world scenarios when user base is small enough), while information about streamed tracks (2.4 gigabytes) will be uploaded to a Hive table.

Training environment

For simplicity and quick iterations, we use an already pre-configured single-node Hadoop YARN cluster with Hive and MySQL. An easy way to get it is to download Hortonworks Sandbox (we used Hortonworks Sandbox 2.2).

Presto installation

Because the installation of Presto is very smooth and well documented, we don’t duplicate these steps here. Should you follow the instructions carefully, you have a running Presto server in less then 10 minutes.

Please note that Presto requires Java 8.

Unfortunately, Presto on YARN is not “officially” supported yet. This means that a Presto daemon should be installed on each node of your cluster (and share resources with NodeManager and YARN containers, if you happen to run YARN on the same nodes). There is, however, an interesting blog post that describes how to run Presto on YARN over Apache Twill.

Integration between Presto and MySQL and Hive

Please follow the instructions needed to integrate Presto with MySQL and Hive. In case of Hortonworks Sandbox 2.2, files shown below contain exemplary (and working) configuration settings.

When these files are created, you can start Presto server.

Uploading input datasets into MySQL and Hive

First we create a MySQL database and table. Then, we upload information about users into it.

Next, we create a Hive database and table. Then we upload streamed tracks into it.

Using Presto to query both MySQL and Hive tables

Finally, we are ready to process data! 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值