presto 0.180初体验

最新推荐文章于 2020-12-11 10:29:23 发布

诸葛云长true

最新推荐文章于 2020-12-11 10:29:23 发布

阅读量550

点赞数 1

分类专栏： presto 文章标签：大数据 presto hive

本文链接：https://blog.csdn.net/xyf123/article/details/75627560

版权

presto 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

presto是一个运行在多台服务器上的分布式系统。完整安装包括一个coordinator（调度节点）和多个worker。由客户端提交查询，从Presto命令行CLI提交到coordinator。 coordinator进行解析，分析并执行查询计划，然后分发处理队列到worker。目前版本已到0.180，详细文档见英文网站：https://prestodb.io/docs/current/index.html

安装环境

centos 6.6
jdk 1.8+
python 2.7

集群规划
192.168.11.1(hostname为presto1，coordinator node，协调节点)
192.168.11.2(hostname为presto2，worker node，工作节点)
192.168.11.3(hostname为presto3，worker node，工作节点)

安装步骤

1、root账号登录presto1节点（coordinator node）

2、下载最新版的presto，当前版本号为0.180

wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.180/presto-server-0.180.tar.gz

3、解压缩presto-server-0.180.tar.gz到目录/usr下，将presto-server-0.180目录改名为presto, 进入presto

tar -zxvf presto-server-0.180.tar.gz -C /usr
cd /usr
mv presto-server-0.180 presto
cd presto

4、配置协调节点presto1
1) 创建etc目录，进入etc

mkdir etc
cd etc

2) 配置node.properties，执行命令:

vim node.properties

在文件中添加

node.environment=production
node.id=ffffffff-ffff-ffff-ffff-fffffffffff1
node.data-dir=/data/presto

其中node.id必须设置成集群里面唯一，node.data-dir目录在各个节点必须事先创建好

3）配置jvm.properties，执行命令:
vim jvm.properties

在文件中添加

-server
-Xmx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError

4）配置config.properties，执行命令:

vim config.properties

在文件中添加

coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8090
query.max-memory=50GB
query.max-memory-per-node=2GB
discovery-server.enabled=true
discovery.uri=http://presto1:8090

其中 coordinator=true表示当前节点为coordinator node, node-scheduler.include-coordinator=false表示不向本节点派发任务

5）配置log.properties，打开命令:

vim log.properties

在文件中添加

com.facebook.presto=INFO

6) 配置catalog
a、创建子目录catalog并进入

mkdir catalog
cd catalog

b、配置jmx.properties,执行命令:
vim jmx.properties

在文件中添加

connector.name=jmx

c、配置hive.properties,执行命令:

vim hive.properties

在文件中添加

connector.name=hive-hadoop2
hive.metastore.uri=thrift://hive-address:9083
hive.config.resources=/etc/hadoop/2.6.0.3-8/0/core-site.xml, /etc/hadoop/2.6.0.3-8/0/hdfs-site.xml

其中thrift://hive-address:9083为hive开放的thrift地址, core-site.xml、hdfs-site.xml要设置为正确的hadoop配置路径

5、启动presto1, 检查日志，协调节点启动成功
启动presto

cd /usr/presto
bin/launcher start

查看日志,

tail -f /data/presto/var/log/server.log

6、配置worker node, 依次安装配置presto2、presto3
1）将presto整个拷贝到presto2中的/usr目录中

scp -r /usr/presto presto2:/usr/

2) root登录presto2，创建/data/presto,进入/usr/presto，

ssh presto2
mkdir /data/presto
cd /usr/presto

3）修改配置
a、修改etc/node.properties，执行命令：

vim etc/node.properties

修改node.id为

node.id=ffffffff-ffff-ffff-ffff-fffffffffff2

b、修改etc/config.properties，执行命令:

vim etc/config.properties

将原来的配置全部替换为

coordinator=false
http-server.http.port=8090
query.max-memory=20GB
query.max-memory-per-node=2GB
discovery.uri=http://presto1:8090

4) 启动presto2(worker节点), 并查看日志，确保启动成功
启动presto2

cd /usr/presto
bin/launcher start

查看日志

tail -f /data/presto/var/log/server.log

5) 配置presto3，重复上述过程

7、在节点presto1上进行测试
1）在presto1测试，进入presto1，下载presto-cli-0.181-executable.jar，改名presto-cli

ssh presto1
wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.181/presto-cli-0.181-executable.jar
mv presto-cli-0.181-executable.jar presto-cli

2) 给presto-cli添加执行权限

chmod +x presto-cli

3) 连接presto-server

./presto-cli --server localhost:8090 --catalog hive --schema default

4) 执行sql，sql涉及的表必须在hive中已经定义

select * from test_db.test_table limit 10;

8、python调用presto
1) python调用presto必须安装pyhive, 目前版本是0.4.0

pip install pyhive[hive]
pip install pyhive[presto]

2) 编写python测试脚本，并执行

from pyhive import presto
import pandas as pd

#连接presto数据库, 执行sql
conn = presto.connect(host='192.168.11.1',port=8090)
cursor = conn.cursor()
cursor.execute('select * from test_db.test_table limit 10')

# 使用pandas展示结果
cols=[a[0] for a in cursor.description]
df=pd.DataFrame.from_records(cursor.fetchall(),columns=cols)
df.head(20)

9、测试结果
a、lzo格式的hive表查询失败，估计是不兼容

b、parquet格式的hive表查询成功，在早期的版本0.100有问题，0.180有改进了

c、orc格式的hive表查询成功, 性能比parquet的要好，但是使用spark sql时性能正好相反

10、presto和spark sql比较（个人观点）

个人感觉presto更像是一个关系型的数据库，连接后传入sql，直接返回结果；而spark sql有很多spark的基因，每次查询都需要考虑需要使用多少个节点，每个节点使用多少内存，多少个核，无法大大增加了分析人员的使用难度。