Presto查询内存优化，可缓解内存不足的症状

最新推荐文章于 2023-06-27 15:10:17 发布

ArchonGum

最新推荐文章于 2023-06-27 15:10:17 发布

阅读量7.1k

点赞数

分类专栏： presto Java big data data warehouse 文章标签： pretso hive oom big data prestosql

本文链接：https://blog.csdn.net/c13232906050/article/details/99959188

版权

个人博客原文

使用条件

Hive v1 bucketing table: v1版本的分桶表（v2没测试，presto对hive 3.x的支持目前还在进行中）

其他支持分桶的数据源connector，需要实现presto特定的方法
@david: Assuming it’s hashing as in Hive, and two tables bucketed the same way are compatible, then that could in theory be implemented in the Kudu connector.
The connector needs to expose the bucketing and splits to the engine in a specific way.

原理

Presto的Grouped Execution特性。

根据相同字段（orderid）分桶（bucketing）且分桶数量相同的两个表（orders，orders_item），
在通过orderid进行join的时候，由于两个表相同的orderid都分到相同id的桶里，所以是可以独立进行join以及聚合计算的（参考MapReduer的partition过程）。

通过控制并行处理桶的数量来限制内存的占用。

计算理论占用的内存：优化后的内存占用=原内存占用/表的桶数量*并行处理桶的数量

测试环境

Ubuntu 14.04
PrestoSQL-317
Hive connector (Hive 3.1)
TPCH connector

测试步骤

使用Hive作为默认的数据源连接（免写hive前缀）

1 建表

-- 复制数据到hive
create table orders as select * from tpch.sf1.orders;

-- drop table test_grouped_join1;
CREATE TABLE test_grouped_join1
WITH (bucket_count = 13, bucketed_by = ARRAY['key1']) as
SELECT orderkey key1, comment value1 FROM orders;

-- drop table test_grouped_join2;
CREATE TABLE test_grouped_join2
WITH (bucket_count = 13, bucketed_by = ARRAY['key2']) as
SELECT o

最低0.47元/天解锁文章