1 概述
1.1 编写目的
本文主要描述基于Hive+Tez计算框架,执行大数据基准测试TPC-DS案例出现内存溢出问题,通过对案例的分析定位,找出导致内存溢出的根本原因,并给出合理的解决方案。
通过本案例的研究报告,旨在协助开发人员后续在遇到Hive+Tez内存溢出问题时,能够起到一些指导和借鉴意义。
1.2 适用范围
Hive性能测试、JVM调优。
1.3 名称解释
TPC-DS:TPC(TransactionProcessing Performance)是一个非赢利的标准化组织,TPC中有个Decision Support(DS)子集,即TPC-DS,是用于评测决策支持系统(或数据仓库)的标准SQL测试集。
1.4 参考资料
1.《bk_hive-performance-tuning.pdf》
2.《bk_yarn-resource-management.pdf》
3. https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/index.html
4.https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.1.1/bk_installing_manually_book/content/rpm-chap1-11.html
2 案例介绍
2.1 案例简述
本案例是采用Hive+Tez计算框架,通过Hive SQL进行离线数据计算分析的任务,计算五张表包含and 和 or等多个条件复杂组合的join和sum聚合操作。
Hive+Tez执行流程示意图
Hive SQL五张表的计算任务如下:
select sum (ss_quantity) from store_sales, store, customer_demographics, customer_address, date_dim where s_store_sk = ss_store_sk and ss_sold_date_sk = d_date_sk and d_year = 1998 and ( ( cd_demo_sk = ss_cdemo_sk and cd_marital_status = 'M' and cd_education_status = 'Unknown' and ss_sales_price between 100.00 and 150.00 ) or ( cd_demo_sk = ss_cdemo_sk and cd_marital_status = 'W' and cd_education_status = 'College' and ss_sales_price between 50.00 and 100.00 ) or ( cd_demo_sk = ss_cdemo_sk and cd_marital_status = 'D' and cd_education_status = 'Primary' and ss_sales_price between 150.00 and 200.00 ) ) and ( ( ss_addr_sk = ca_address_sk and ca_country = 'United States' and ca_state in ('MI', 'GA', 'NH') and ss_net_profit between 0 and 2000 ) or (ss_addr_sk = ca_address_sk and ca_country = 'United States' and ca_state in ('TX', 'KY', 'SD') and ss_net_profit between 150 and 3000 ) or (ss_addr_sk = ca_address_sk and ca_country = 'United States' and ca_state in ('NY', 'OH', 'FL') and ss_net_profit between 50 and 25000 ) ) ; |
2.2 问题现象
使用Hive On Tez引擎提交 Apache Hive 查询作业后,任务一直处于“卡顿”状态,在 Tez 视图中作业状态一直处于“正在运行”,但没有任何进展。
使用Spark On Hive提交该查询作业后,可以很快跑出结果。
Spark On Hive执行查询任务命令:
kinit -kt /home/spTest1/spTest1User.keytab spTest1 /usr/hdp/2.6.2.14-5/spark2/bin/spark-sql --master yarn --deploy-mode client --executor-memory 2G --driver-memory 2G use tpcds_bin_partitioned_orc_2; source /home/hdfs/hive-testbench-hdp3/sample-queries-tpcds/query48.sql; |
下图为Spark On Hive执行作业完成图:
2.3 告警信息
通过Tez UI查看详细任务状态,发现始终有一个Map任务处于Running状态,迟迟不能结束。任务进度状态如下图所示:
点击查看该Map任务的详细信息,发现该Map任务被查分成多个子任务并分发到不同机器上执行查询。Map任务的子任务信息如下图所示:
选择该Map任务的一条子任务,查看详细日志。发现TotalRequested/TotalJVMHeap的值为1.09。该参数的意思是请求的JVM内存与JVM总内存比例超出了1.0,即系统现有的Java内存任务不能满足当前任务所需内存。
2.4 监控信息
如下图所示,Hive查询任务执行过程中,被分为五个Map阶段和一个Reduce阶段。其中四个Map任务会很快的完成,而Map1任务一致处于阻塞状态。
3 TPCDS模型
3.1 TPCDS介绍
TPCDS是模拟OLAP应用的典型场景,因其贴近真实场景,且有工具直接生成数据,而成为SQL引擎测试组最常用的测试模型。然而实际手工用例编写过程中,并没有使用TPCDS完整模型,而只是选择部分表作为测试表,同时测试用例编写中主要考虑语法、及与相关特性结合,而没有结合实际应用场景。为了提高手工用例的质量,更贴近用户使用,同时结合TPCDS场景,构建集成测试和系统测试的测试用例、测试场景,深入分析TPCDS是非常有必要的,本文将从几个角度分别进行分析:模型、场景、语句、数据。
3.2 TPCDS场景概述
TPCDS场景中,包含了一个成熟的决策支持系统所具有的两项最重要的因素:
1. 用户查询语句,将数据操作转化为商业智能。TPCDS的语句可以分为以下四类决策支持语句:
- l报告型,这类语句通常是预定义的,会周期性地运行,来判断企业的财务和运行状况,比如月度、季度、年度财务报表。通常为静态的,改变的只是时间段、地理位置或商品名称。
- l即席查询(Ad hoc),这类语句主要用于即时、特定的商业问题。它与报告型语句的主要差别在于预知程度(报告型都是计划中的,而Ad hoc通常是临时性的)。
- l迭代OLAP,这类语句主要通过对数据进行研究分析来判断事物之间的联系和发展趋势。与Ad hoc语句相似,区别为该类语句基于一个具体的用户场景,且会提交一系列的复杂或简单语句。比如,通过某类客户过去几年的购买力分析,预测将来的购买力趋势。
- l数据挖掘,这类语句是对大量数据进行分析,挖掘潜在的、无法直观得出的结论,从而支持商业决策。比如购物网站的购买推荐。
说明:报告型语句通常可预先优化,因此执行时间短,而即席查询正好相反。TPCDS中,catalog渠道相关语句模拟报告型语句,而store和web渠道相关语句模拟即席查询部分。
2.数据维护,对于决策分析所依赖的源数据需要同步管理。在实际生产环境中,数据维护的过程通常包括三个步骤:数据抽取、数据转换、数据加载(ETL)。TPCDS数据维护过程为:
- l加载要更新的数据集,包括insert/delete/update的数据
- l加载要更新的数据集到数据仓库,进行数据转换
- l插入新的事实表记录,并根据时间删除某些记录
举例说明TPCDS模拟的商业活动:
1.记录各个销售渠道的所有客户购买信息与退货信息
2.根据促销信息修改价格
3.维护仓库存货信息
4.创建动态网页
5.维护客户信息
3.3 TPCDS模型介绍
3.3.1 概念
在多维分析的商业智能解决方案中,根据事实表和维度表的关系,常见的模型分为星型模型和雪花型模型。
星型模型:
中央表包含事实数据,多个表以中央表为中心呈放射状分布,它们通过数据库的主键和外键相互连接,是一种使用关系数据库实现多维分析空间的模式,其基本形式必须实现多维空间,以使用关系数据库的基本功能。同时星型模型也是一种非正规化的模型,多维数据集的每一个维度直接与事实表连接,没有渐变维度,所以存在冗余数据。在星型模型中,只需要扫描事实表就可以进行查询,主要的数据都在庞大的事实表中,所以查询效率较高,同时每个维度表和事实表关联,非常直观,很容易组合出各种查询。
雪花型模型:
雪花模型在星型模型的基础上,维度表进一步规范化为子维度表,这些子维度表没有直接与事实表连接,而是通过其他维度表连接到事实表上,看起来就像一片雪花,故称雪花模型。也就是说雪花模型是星型模型的进一步扩展,将其维度表扩展为更小的维度表,形成一种层次。这样就通过最大限度的减少数据存储量以及联合较小的维度表来改善查询性能,且去除了星型模型中的冗余数据。
两种模型比较:
星型模型 | 雪花型模型 | |
特点 | 1. 非正规化; 2. 多维数据集中的每一个维度都与事实表连接(通过主键和外键); 3. 不存在渐变维度; 4. 有冗余数据; 5. 查询效率可能会比较高; 6. 不用过多考虑正规化因素,设计维护较为简单 | 1. 正规化; 2. 数据冗余少; 3. 有些数据需要连接才能获取,可能效率较低; 4. 规范化操作较复杂,导致设计及后期维护复杂; |
3.3.2 TPCDS模型概述
TPCDS模型模拟一个全国连锁的大型零售商的销售系统,其中含有三种销售渠道:store(实体店)、web(网店)、catalog(电话订购),每种渠道使用两张表分别模拟销售记录和退货记录,同时包含商品信息和促销信息的相关表结构。
TPCDS采用雪花型数据模型,三种渠道的销售、退货表、及总体的存货清单作为事实表,其他商品相关信息、用户相关信息、时间信息等其他信息等都作为维度表,同时各表命名达到见名识义,详细如下表所示:
事实表 | 维度表 |
Store Sales、Store Returns Catalog Sales、 Catalog Returns Web Sales、 Web Returns Inventory | Store、Call Center、Web_site Warehouse 、Catalog_page、Web_page Customer、Customer_address、Customer_demographics Household_demographics、Income_band Date_dim、Time_dim Item、Promotion、Reason、Ship_mode |
上述维度表具体分为三类:
1.静态的:导入数据库以后,数据将永远保持不变,如date_dim、time_dim
2.历史相关的:即数据记录跟时间相关,每条记录只在对应的时间段内有效。比如:item(如一月份有商品1,二月份有商品2)
3.非历史相关的:即数据记录跟时间无关,不需要保留历史数据,比如:customer
它包含几组模型关系图,详细如下:
Store_sales 的ER图
Store_returns 的ER图
Catalog Sales 的ER图
Catalog Returns 的ER图
Web Sales 的ER图
Web Returns的ER图
Inventory的ER图
Schema说明:
1) 表、索引可以使用水平和垂直分区
2) 除了基表,可以建立其他的辅助表结构(EADS),如索引、物化视图(只允许事实表与维度表间的主外键关系连接)。
3) 可以使用主外键、NOT NULL约束
3.4 TPCDS语句
TPCDS所有99个Query,覆盖SQL99和2003的核心部分以及OLAP。下面我们从语法和语义两个角度进行分析。
1. 语法
l语句特点,总结如下:
SQL特征 | 查询数量 |
Correlated subquery | 10 |
Non-correlated subquery | 34 |
Group by | 81 |
Order by | 89 |
Rollup | 11 |
Exists | 5 |
Union | 18 |
Intersect | 3 |
Case | 29 |
Substr | 9 |
Having | 7 |
Over | 15 |
CTE | 28 |
- 过滤条件主要用在维度表上,具体集中在如下几张表:date_dim(d_year/d_date/ d_qoy/d_dom)、item(i_category/i_current_price/i_manufact_id/i_class)、customer_address(ca_state/ca_zip/ca_gmt_offset/ca_city/ca_county)、customer_demographics(cd_gender/cd_marital_status/cd_education_status)、houshold_demographics(hd_vehicle_count/hd_buy_potential/hd_dep_count)、store(s_state);
- l 连接条件为表间的PK-FK关系上
l 聚集运算主要集中在事实表上,例如:价格、数量等。
2.语义
TPCDS中每个语句都回答一个商业问题,重点在于各种统计,可以分为从客户、商品两个维度进行,统计通常只针对某个具体的时间、具体的类别,进行环比、同比等各种分析。
3.5 TPCDS测试流程
TPC-DS 测试流程包括:
1) 数据加载测试
数据加载分为两种方式:a)使用生成完的数据进行导入;b)使用dsdgen边生成边导入。
数据加载时间包括:测试基础表创建、数据加载、约束验证、辅助数据结构(如索引)创建、表和辅助数据统计分析等。
l Power测试之前,不应重启系统
2) 查询顺序执行( Power)测试
l 评测数据库对单个查询流的处理能力。
3) 第一次查询并行执行测试
l用于测试DBMS对多个查询流并发查询和操作的处理能力,最小值为4(记为Sq)。例如,模拟20个用户同时进行操作,则同时启动20个session。
l每个session会是一个不同的语句执行顺序(由dsdgen可生成,该顺序是固定的)。
l每个session一次只运行一条语句。
4) 第一次数据维护测试
- 更新次数为并发测试中的并发数除以2,即Sq/2。例如,模拟20个用户同时操作,则其中10个用户会有更新操作。
- 单次更新中的语句可以顺序或并发执行,更新语句只涉及事实表。
- 更新数据由dedgen生成,更新的顺序与dsdgen生成保持顺序一致。
- 第一步中创建的EADS,会随之进行更新维护。
- 数据维护过程中需要有硬件故障注入。
5) 第二次查询并行执行测试
6) 第二次数据维护测试
3.6 TPCDS测试指导
3.6.1 用例
根据上文TPCDS的具体分析,用于手工用例的指导原则如下:
1. 选择的表应为上文中的某个具体的ER图中的全部或部分表,根据使用频率,选取最小集示例如下:
2.分布键主要建在PK键上,分区键主要建在常用的过滤条件上,在常用的过滤条件和PK、FK上建立单列、多列索引
3.实际编写用例时,注意如下几点:
- 表之间的连接条件,主要根据上图中的PK-FK关系建立连接
- 过滤条件主要放在维度表
- 聚集运算主要放在事实表
- 窗口函数、OLAP函数的灵活使用
- 条件结合函数使用(如字符串函数、类型转换函数、数学运算函数)
- 表达式的灵活使用(与条件结合、与聚集运算结合)
- 灵活运用子查询、相关子查询
- 多处使用同样的临时结果集时,考虑使用CTE,即CTE通常在多处被引用
- 集合运算主要用在同一类操作、不同的事实表上
3.6.2 场景
集成测试、系统测试、专项测试等都可以利用TPCDS场景,或根据TPCDS场景构建新的测试场景,构建场景注意:
1.场景测试应包括创建、加载、查询、更新等整体流程,做到一键式测试。测试结果可分析、可重现。
2.查询语句,依据2.1中四类语句进行构建,查询语句包含简单、复杂语句,结果集有大有小,同时查询语句可选择串行、并发运行。
3.更新语句,更新与查询语句交替进行测试,更新以事实表为主,更新的并发度与查询的并发度对应。
4 测试环境
4.1 硬件环境
服务器IP | 主机名 | 操作系统 | 内存(GB) | 磁盘(GB) | CPU(Core) |
172.16.80.41 | sp001 | CentOS7.3 | 32 | 620 | 8 |
172.16.80.42 | sp002 | CentOS7.3 | 32 | 620 | 8 |
172.16.80.43 | sp003 | CentOS7.3 | 32 | 620 | 8 |
172.16.80.44 | sp004 | CentOS7.3 | 32 | 620 | 8 |
4.2 软件环境
- 操作系统:CentOS7.3
- JDK:1.8.0_251
- HDP:2.6.2.14
- Hadoop:2.7.3
4.3 系统组成
大数据平台包括Hadoop、Hive、Hbase、Flume、Kafka、Storm、Spark、Ranger、Atlas、Kerberos等大数据常用组件,相互协调完成大数据应用开发:
4.4 测试对象
- Hive:1.2.1
- Tez:0.7.0
4.5 测试数据
4.5.1 原始数据
原始数据表信息总览:
数据容量 | 2g |
Sql测试数据量 | 20 |
Sql是否经过修改 | 无 |
原始数据表详细信息:
表名 | 存储文件数(份) | 原始数据集大小(MB) |
call_center | 1 | 0.0174 |
catalog_page | 1 | 5.2834 |
catalog_returns | 2076 | |
catalog_sales | 1840 | |
customer | 1 | 121.7009 |
customer_address | 1 | 79.9643 |
customer_demographics | 6 | 718.0725 |
date_dim | 1 | 75.7258 |
household_demographics | 1 | 0.8446 |
income_band | 1 | 0.0003 |
inventory | 2 | 449.8106 |
item | 1 | 44.3652 |
promotion | 1 | 0.4196 |
reason | 1 | 0.0100 |
ship_mode | 1 | 0.0100 |
store | 1 | 0.0436 |
store_returns | 2004 | |
store_sales | 1824 | |
time_dim | 1 | 43.0115 |
warehouse | 1 | 0.0055 |
web_page | 1 | 0.0386 |
web_returns | 2140 | |
web_sales | 1824 | |
web_site | 1 | 0.0590 |
4.5.2 入库数据
SQL集合中只包含一张事实表(store_sales)和9个维度表,生成的TEXT格式数据大小约2GB,实际导入的压缩的orc文件总数据量只有1.5GB。
1、Hive数据表总览:
事实表数量 | 1 |
维度表数量 | 9 |
事实表分区方式 | Unique value dynamic partition |
维度表分区方式 | 无 |
2、Hive数据表详细信息:
表名 | 表类型 | 数据行数(S) |
call_center | 内部表 | 8 |
catalog_page | 内部表 | 11718 |
catalog_returns | 分区表 | 288491 |
catalog_sales | 分区表 | 2865785 |
customer | 内部表 | 144000 |
customer_address | 内部表 | 72000 |
customer_demographics | 内部表 | 1920800 |
date_dim | 内部表 | 73049 |
household_demographics | 内部表 | 7200 |
income_band | 内部表 | 20 |
inventory | 内部表 | 16966305 |
item | 内部表 | 26000 |
promotion | 内部表 | 322 |
reason | 内部表 | 36 |
ship_mode | 内部表 | 20 |
store | 内部表 | 22 |
store_returns | 分区表 | 555309 |
store_sales | 分区表 | 5760749 |
time_dim | 内部表 | 86400 |
warehouse | 内部表 | 5 |
web_page | 内部表 | 74 |
web_returns | 分区表 | 137272 |
web_sales | 分区表 | 1438883 |
web_site | 内部表 | 30 |
5 案例分析
5.1 理论依据
5.1.1 Apache Hive查询计划
依据官网中的Hive介绍,可以得知构建在hadoop MR引擎上的Hive架构,Hive的执行需要经过Driver的处理(如下图),其中包括Compiler、Optimizer、Executor等,将完成HQL查询语句从词法分析、语法分析、编译、优化以及查询计划的生成。生成的查询计划存储在HDFS中,并在随后由MapReduce/Tez调用执行。
因此可以依据对查询计划的分析,找出Hive的执行路径,供实际执行过程分析的依据。
Hive架构图
5.1.2 Apache Tez计算引擎
Tez是Apache最新的支持DAG作业的开源计算框架,它可以将多个有依赖的作业转换为一个作业从而大幅提升DAG作业的性能。它的核心思想是把将Map任务和Reduce任务进一步拆分,Map任务拆分为Input-Processor-Sort-Merge-Output,Reduce任务拆分为Input-Shuffer-Sort-Merge-Process-output,Tez将若干小任务灵活重组,形成一个大的DAG作业。如下图所示:
1、Hive-MR计算框架包含多个MR任务,每个任务都将中间结果存储到HDFS上,前一个步骤中的reducer为下一个步骤中的mapper提供数据。
- Hive-Tez计算框架避免中间数据写回HDFS,仅在一个任务中就能完成同
样的处理过程,减小任务执行时间。
因此,作为框架思路,Tez本身也是MapReduce的改进。而由于多个任务可
能会合并为一个任务,需要不同于MR引擎的资源分配策略,对任务所需的Container Memory、Virtual Memory及JVM heap等资源进行合理分配。
5.1.3 HDP 官方资源分配策略
Apache Hive执行任务过程中,资源管理由YARN组件负责。YARN考虑了集群中每台机器上的所有可用计算资源。YARN根据可用的资源对运行在集群中的应用(如MapReduce)的资源请求进行协商。然后YARN通过分配容器为每个应用程序提供处理能力。容器是YARN中处理能力的基本单元,是资源元素(如内存和CPU)的封装。
在Hadoop集群中,平衡内存(RAM)、处理器(CPU内核)和磁盘的使用至关重要,这样处理就不会受到这些集群资源中的任何一种的限制。作为一个普遍的建议,允许每个磁盘和每个核使用两个Container可以最好地平衡集群的利用率。
在为集群节点确定适当的YARN内存配置时,从可用的硬件资源开始。在每个节点上需要注意以下值:
- RAM (内存)
- CORES (CPU)
- DISKS (磁盘)
- 计算预留内存
YARN的总可用内存应该考虑预留内存。预留内存是系统进程和其他Hadoop进程(如HBase)所需要的RAM。
使用下表来确定每个节点的预留内存。
保留内存建议:
每个节点总内存 | 推荐预留系统内存 | HBase推荐预留内存 |
4 GB | 1 GB | 1 GB |
8 GB | 2 GB | 1 GB |
16 GB | 2 GB | 2 GB |
24 GB | 4 GB | 4 GB |
48 GB | 6 GB | 8 GB |
64 GB | 8 GB | 8 GB |
72 GB | 8 GB | 8 GB |
96 GB | 12 GB | 16 GB |
128 GB | 24 GB | 24 GB |
256 GB | 32 GB | 32 GB |
512 GB | 64 GB | 64GB |
- 计算容器数量
接下来的计算是确定每个节点允许的最大容器数量。可以使用以下公式:
containers = min (2*CORES, 1.8*DISKS, (Total available RAM) / MIN_CONTAINER_SIZE) |
其中“DISKS”为s的值。每台机器的dirs(数据磁盘的数量)。
MIN_CONTAINER_SIZE是最小的容器大小(在内存中)。这个值取决于可用RAM的数量——在较小的内存节点中,最小的容器大小也应该更小。下表列出了推荐值:
每个节点的总内存 | Container大小最小值 |
少于4 GB | 256 MB |
4 GB~8 GB | 512 MB |
8 GB~24 GB | 1024 MB |
大于24 GB | 2048 MB |
最终的计算是确定每个容器的RAM数量:
RAM-per-container = max(MIN_CONTAINER_SIZE, (Total Available RAM) / containers)) |
通过这些计算,可以设置YARN和其他组件的配置。
- 计算内存大小
YARN、MapReduce和Tez配置参数的计算:
配置参数 | 值计算 |
yarn.nodemanager.resource.memory-mb | = containers * RAM-per-container |
yarn.scheduler.minimum-allocation-mb | = RAM-per-container |
yarn.scheduler.maximum-allocation-mb | = containers * RAM-per-container |
mapreduce.map.memory.mb | = RAM-per-container |
mapreduce.reduce.memory.mb | = 2 * RAM-per-container |
mapreduce.map.java.opts | = 0.8 * RAM-per-container |
mapreduce.reduce.java.opts | = 0.8 * 2 * RAM-per-container |
yarn.app.mapreduce.am.resource.mb | = 2 * RAM-per-container |
yarn.app.mapreduce.am.command-opts | = 0.8 * 2 * RAM-per-container |
tez.am.resource.memory.mb | = RAM-per-container |
tez.am.launch.cmd-opts | = 0.8 * tez.am.resource.memory.mb |
hive.tez.container.size | = RAM-per-container or 2 * RAM-per-container |
hive.tez.java.opts | = 0.8 * tez.am.resource.memory.mb |
tez.container.max.java.heap.fraction | 默认0.8,该参数建议调整,需根据具体业务情况修改 |
tez.runtime.io.sort.mb | =0.4 * hive.tez.container.size |
5.2 做减法
对执行任务做减法的目的是通过减少数据量的方式,测试该语句的可执行性,分析出在既定资源的基础上数据量的极限值,对资源分配策略提供依据。经过对原始SQL语句的初步分析,发现任务执行卡顿主要是因为有2张表的原始数据量比较大,而虽然参与计算的2张表数据量不是很大(百万级别),但经过20多个and和or的组合条件,中间结果将会生成数亿及数10亿的数据量,导致内存溢出。因此将原5张表的执行任务做减法,针对性的分析这2张大表的执行过程,目的为找出这2张大表的数据量的执行极限值。具体方法如下:
1、做减法时的资源分配情况:
tez.am.resource.memory.mb = RAM-per-container = 2 GB hive.tez.container.size = RAM-per-container = 2 GB hive.tez.java.opts = 0.8 * RAM-per-container = 1.6 GB tez.container.max.java.heap.fraction = 0.8 tez.am.java.opts = 0.8 * RAM-per-container = 1.6 GB tez.runtime.io.sort.mb = 40% * hive.tez.container.size = 0.4 * 2 GB = 0.8 GB hive.auto.convert.join.noconditionaltask.size = 667MB |
2、将customer_demographics表数据固定在10000条数据,date_dim数据一次做减法测试,测出该语句可查询的极限值。
表名 例子 | customer_demographics 数据量(Row) | date_dim 数据量(Row) | 是否成功 |
Test1 | 10000 | 45000 | 否 |
Test2 | 10000 | 40000 | 否 |
Test3 | 10000 | 38000 | 否 |
Test4 | 10000 | 37000 | 否 |
Test5 | 10000 | 36000 | 否 |
Test6 | 10000 | 35900 | 否 |
Test7 | 10000 | 35800 | 否 |
Test8 | 10000 | 35795 | 否 |
Test9 | 10000 | 35794 | 是 |
Test10 | 10000 | 35793 | 是 |
Test11 | 10000 | 35790 | 是 |
经过做减法测试,可以得出,在既定资源分配策略(上述)基础上,”or组合条件”任务的查询,仅支持2张表数据量在10000条和35794条。一旦超出数据量,将会频繁GC,初步定位为因执行路径不同导致任务每个阶段所需资源不同。
附:Hive缩减数据量分析方法:
with cd as (select * from customer_demographics limit 10000), ca as (select * from customer_address ), dd as (select * from date_dim limit 35795) select sum (ss_quantity) from store_sales, store, cd,ca , dd where s_store_sk = ss_store_sk and ss_sold_date_sk = d_date_sk and d_year = 1998 and ( ( cd_demo_sk = ss_cdemo_sk and cd_marital_status = 'M' and cd_education_status = 'Unknown' and ss_sales_price between 100.00 and 150.00 ) or ( cd_demo_sk = ss_cdemo_sk and cd_marital_status = 'W' and cd_education_status = 'College' and ss_sales_price between 50.00 and 100.00 ) or ( cd_demo_sk = ss_cdemo_sk and cd_marital_status = 'D' and cd_education_status = 'Primary' and ss_sales_price between 150.00 and 200.00 ) ) and ( ( ss_addr_sk = ca_address_sk and ca_country = 'United States' and ca_state in ('MI', 'GA', 'NH') and ss_net_profit between 0 and 2000 ) or (ss_addr_sk = ca_address_sk and ca_country = 'United States' and ca_state in ('TX', 'KY', 'SD') and ss_net_profit between 150 and 3000 ) or (ss_addr_sk = ca_address_sk and ca_country = 'United States' and ca_state in ('NY', 'OH', 'FL') and ss_net_profit between 50 and 25000 ) ) ; |
5.3 资源优化
依据上一步对任务做减法的分析结果得出的数据量极限值,进一步研究是否可以通过资源优化的方式加大资源配比,保证任务正常执行。目标是找出影响任务执行的资源配比的核心参数。
关键参数如下:
tez.am.resource.memory.mb
默认值 | 参数说明 | 详细解释 |
128 | Application Master分配的container大小,单位为M |
tez.am.launch.cmd-opts
默认值 | 参数说明 | 详细解释 |
-Dlog4j.configurationFile= tez-container-log4j2.properties -Dtez.container.log.level= INFO -Dtez.container.root.logger=CLA | Tez AppMaster进程启动期间提供的命令行选项。 不要在这些启动选项中设置任何Xmx或Xms,以便Tez可以自动确定它们 | 不需要主动设置 |
hive.tez.container.size
默认值 | 参数说明 | 详细解释 |
128 | Tez AppMaster向RM申请的container大小,单位M | 不需要主动设置TEZ的AppMaster占用的container大小由TEZ自动跳转,但是向AM申请出来的container大小则需要本参数管理 |
hive.tez.java.opts
默认值 | 参数说明 | 详细解释 |
-Dlog4j.configurationFile =tez-container-log4j2.properties -Dtez.container.log.level=INFO -Dtez.container.root.logger=CLA | container进程启动期间提供的命令行选项。可以在默认参数后续添加内存参数选项,比如:-Xmx7500m -Xms 7500m | 该参数大小一般为hive.tez.container.size的80%,不建议直接在该参数中直接添加Xmx/Xms,而是使用下面的参数调参opts大小 |
tez.container.max.java.heap.fraction
默认值 | 参数说明 | 详细解释 |
0.8 | 如果hive.tez.java.ops参数中没有设置Xmx/Xms指标的话,TEZ将选择该参数来确定Xmx/Xms的值,值得大小为0.8*hive.tez.container.size | 建议使用该值来调整opts |
5.4 根因定位
5.4.1 执行路径问题
通过对不同条件的查询计划的分析,找出相同任务的不同执行路径区别,目的是分析出不同的执行路径对Hive执行过程的影响。
1、任务采用”or”的组合条件查询计划
Vertex dependency in root stage Map 1 <- Map 5 (BROADCAST_EDGE), Map 6 (BROADCAST_EDGE), Map 7 (BROADCAST_EDGE), Map 8 (BROADCAST_EDGE), Union 2 (CONTAINS) Map 14 <- Map 15 (BROADCAST_EDGE), Map 16 (BROADCAST_EDGE), Map 17 (BROADCAST_EDGE), Map 18 (BROADCAST_EDGE), Union 2 (CONTAINS) Map 19 <- Map 20 (BROADCAST_EDGE), Map 21 (BROADCAST_EDGE), Map 22 (BROADCAST_EDGE), Map 23 (BROADCAST_EDGE), Union 2 (CONTAINS) Map 24 <- Map 25 (BROADCAST_EDGE), Map 26 (BROADCAST_EDGE), Map 27 (BROADCAST_EDGE), Map 28 (BROADCAST_EDGE), Union 2 (CONTAINS) Map 29 <- Map 30 (BROADCAST_EDGE), Map 31 (BROADCAST_EDGE), Map 32 (BROADCAST_EDGE), Map 33 (BROADCAST_EDGE), Union 2 (CONTAINS) Map 9 <- Map 10 (BROADCAST_EDGE), Map 11 (BROADCAST_EDGE), Map 12 (BROADCAST_EDGE), Map 13 (BROADCAST_EDGE), Union 2 (CONTAINS) Reducer 3 <- Union 2 (SIMPLE_EDGE) Reducer 4 <- Reducer 3 (SIMPLE_EDGE) Stage-0 Fetch Operator limit:-1 Stage-1 Reducer 4 File Output Operator [FS_159] compressed:false Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"} Group By Operator [GBY_157] | aggregations:["sum(VALUE._col0)"] | outputColumnNames:["_col0"] | Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |<-Reducer 3 [SIMPLE_EDGE] Reduce Output Operator [RS_156] sort order: Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE value expressions:_col0 (type: bigint) Group By Operator [GBY_155] aggregations:["sum(_col0)"] outputColumnNames:["_col0"] Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE Group By Operator [GBY_152] | keys:KEY._col0 (type: int) | outputColumnNames:["_col0"] | Statistics:Num rows: 110523 Data size: 884190 Basic stats: COMPLETE Column stats: NONE |<-Union 2 [SIMPLE_EDGE] |<-Map 19 [CONTAINS] vectorized | Reduce Output Operator [RS_151] | key expressions:_col0 (type: int) | Map-reduce partition columns:_col0 (type: int) | sort order:+ | Statistics:Num rows: 221046 Data size: 1768380 Basic stats: COMPLETE Column stats: NONE | Group By Operator [OP_1058] | keys:_col0 (type: int) | outputColumnNames:["_col0"] | Statistics:Num rows: 221046 Data size: 1768380 Basic stats: COMPLETE Column stats: NONE | Select Operator [OP_1057] | outputColumnNames:["_col0"] | Statistics:Num rows: 36841 Data size: 294730 Basic stats: COMPLETE Column stats: NONE | Filter Operator [FIL_1056] | predicate:((_col22 = _col86) and (_col5 = _col70) and (_col26 = _col6)) (type: boolean) | Statistics:Num rows: 36841 Data size: 294730 Basic stats: COMPLETE Column stats: NONE | Map Join Operator [MAPJOIN_1055] | | condition map:[{"":"Inner Join 0 to 1"}] | | HybridGraceHashJoin:true | | keys:{"Map 19":"_col22 (type: bigint)","Map 23":"d_date_sk (type: bigint)"} | | outputColumnNames:["_col5","_col6","_col9","_col22","_col26","_col70","_col86"] | | Statistics:Num rows: 294729 Data size: 2357855 Basic stats: COMPLETE Column stats: NONE | |<-Map 23 [BROADCAST_EDGE] vectorized | | Reduce Output Operator [RS_1048] | | key expressions:d_date_sk (type: bigint) | | Map-reduce partition columns:d_date_sk (type: bigint) | | sort order:+ | | Statistics:Num rows: 3954 Data size: 47450 Basic stats: COMPLETE Column stats: NONE | | Filter Operator [FIL_1047] | | predicate:(d_date_sk is not null and (d_year = 1998)) (type: boolean) | | Statistics:Num rows: 3954 Data size: 47450 Basic stats: COMPLETE Column stats: NONE | | TableScan [TS_78] | | alias:date_dim | | Statistics:Num rows: 15816 Data size: 189801 Basic stats: COMPLETE Column stats: NONE | | Dynamic Partitioning Event Operator [EVENT_270] | | Statistics:Num rows: 3954 Data size: 47450 Basic stats: COMPLETE Column stats: NONE | | Group By Operator [OP_1050] | | keys:_col0 (type: bigint) | | outputColumnNames:["_col0"] | | Statistics:Num rows: 3954 Data size: 47450 Basic stats: COMPLETE Column stats: NONE | | Select Operator [OP_1049] | | outputColumnNames:["_col0"] | | Statistics:Num rows: 3954 Data size: 47450 Basic stats: COMPLETE Column stats: NONE | | Please refer to the previous Filter Operator [FIL_1047] | |<-Map Join Operator [MAPJOIN_1054] | | condition map:[{"":"Inner Join 0 to 1"}] | | HybridGraceHashJoin:true | | keys:{"Map 19":"_col5 (type: bigint)","Map 22":"ca_address_sk (type: bigint)"} | | outputColumnNames:["_col5","_col6","_col9","_col22","_col26","_col70"] | | Statistics:Num rows: 267936 Data size: 2143505 Basic stats: COMPLETE Column stats: NONE | |<-Map 22 [BROADCAST_EDGE] | | Reduce Output Operator [RS_90] | | key expressions:ca_address_sk (type: bigint) | | Map-reduce partition columns:ca_address_sk (type: bigint) | | sort order:+ | | Statistics:Num rows: 3843 Data size: 115290 Basic stats: COMPLETE Column stats: NONE | | Filter Operator [FIL_247] | | predicate:(ca_address_sk is not null and (ca_country = 'United States') and (ca_state) IN ('MI', 'GA', 'NH')) (type: boolean) | | Statistics:Num rows: 3843 Data size: 115290 Basic stats: COMPLETE Column stats: NONE | | TableScan [TS_77] | | alias:customer_address | | Statistics:Num rows: 30745 Data size: 922354 Basic stats: COMPLETE Column stats: NONE | |<-Map Join Operator [MAPJOIN_1053] | | condition map:[{"":"Inner Join 0 to 1"}] | | keys:{} | | outputColumnNames:["_col5","_col6","_col9","_col22","_col26"] | | Statistics:Num rows: 243579 Data size: 1948641 Basic stats: COMPLETE Column stats: NONE | |<-Map 21 [BROADCAST_EDGE] vectorized | | Reduce Output Operator [RS_1046] | | sort order: | | Statistics:Num rows: 1 Data size: 56311 Basic stats: COMPLETE Column stats: COMPLETE | | TableScan [TS_76] | | alias:customer_demographics | | Statistics:Num rows: 1 Data size: 56311 Basic stats: COMPLETE Column stats: COMPLETE | |<-Map Join Operator [MAPJOIN_1052] | | condition map:[{"":"Inner Join 0 to 1"}] | | HybridGraceHashJoin:true | | keys:{"Map 19":"ss_store_sk (type: bigint)","Map 20":"s_store_sk (type: bigint)"} | | outputColumnNames:["_col5","_col6","_col9","_col22","_col26"] | | Statistics:Num rows: 221436 Data size: 1771492 Basic stats: COMPLETE Column stats: NONE | |<-Map 20 [BROADCAST_EDGE] vectorized | | Reduce Output Operator [RS_1045] | | key expressions:s_store_sk (type: bigint) | | Map-reduce partition columns:s_store_sk (type: bigint) | | sort order:+ | | Statistics:Num rows: 324 Data size: 2593 Basic stats: COMPLETE Column stats: NONE | | Filter Operator [FIL_1044] | | predicate:s_store_sk is not null (type: boolean) | | Statistics:Num rows: 324 Data size: 2593 Basic stats: COMPLETE Column stats: NONE | | TableScan [TS_75] | | alias:store | | Statistics:Num rows: 648 Data size: 5186 Basic stats: COMPLETE Column stats: NONE | |<-Filter Operator [FIL_1051] | predicate:(ss_store_sk is not null and ss_addr_sk is not null and ss_net_profit BETWEEN 0 AND 2000) (type: boolean) | Statistics:Num rows: 201306 Data size: 1610448 Basic stats: COMPLETE Column stats: PARTIAL | TableScan [TS_74] | alias:store_sales | Statistics:Num rows: 1610447 Data size: 212697869 Basic stats: COMPLETE Column stats: PARTIAL |<-Map 29 [CONTAINS] vectorized | Reduce Output Operator [RS_151] | key expressions:_col0 (type: int) | Map-reduce partition columns:_col0 (type: int) | sort order:+ | Statistics:Num rows: 221046 Data size: 1768380 Basic stats: COMPLETE Column stats: NONE | Group By Operator [OP_1026] | keys:_col0 (type: int) | outputColumnNames:["_col0"] | Statistics:Num rows: 221046 Data size: 1768380 Basic stats: COMPLETE Column stats: NONE | Select Operator [OP_1025] | outputColumnNames:["_col0"] | Statistics:Num rows: 36841 Data size: 294730 Basic stats: COMPLETE Column stats: NONE | Filter Operator [FIL_1024] | predicate:((_col22 = _col86) and (_col5 = _col70) and (_col26 = _col6)) (type: boolean) | Statistics:Num rows: 36841 Data size: 294730 Basic stats: COMPLETE Column stats: NONE | Map Join Operator [MAPJOIN_1023] | | condition map:[{"":"Inner Join 0 to 1"}] | | HybridGraceHashJoin:true | | keys:{"Map 29":"_col22 (type: bigint)","Map 33":"d_date_sk (type: bigint)"} | | outputColumnNames:["_col5","_col6","_col9","_col22","_col26","_col70","_col86"] | | Statistics:Num rows: 294729 Data size: 2357855 Basic stats: COMPLETE Column stats: NONE | |<-Map 33 [BROADCAST_EDGE] vectorized | | Reduce Output Operator [RS_1016] | | key expressions:d_date_sk (type: bigint) | | Map-reduce partition columns:d_date_sk (type: bigint) | | sort order:+ | | Statistics:Num rows: 3954 Data size: 47450 Basic stats: COMPLETE Column stats: NONE | | Filter Operator [FIL_1015] | | predicate:(d_date_sk is not null and (d_year = 1998)) (type: boolean) | | Statistics:Num rows: 3954 Data size: 47450 Basic stats: COMPLETE Column stats: NONE | | TableScan [TS_127] | | alias:date_dim | | Statistics:Num rows: 15816 Data size: 189801 Basic stats: COMPLETE Column stats: NONE | | Dynamic Partitioning Event Operator [EVENT_276] | | Statistics:Num rows: 3954 Data size: 47450 Basic stats: COMPLETE Column stats: NONE | | Group By Operator [OP_1018] | | keys:_col0 (type: bigint) | | outputColumnNames:["_col0"] | | Statistics:Num rows: 3954 Data size: 47450 Basic stats: COMPLETE Column stats: NONE | | Select Operator [OP_1017] | | outputColumnNames:["_col0"] | | Statistics:Num rows: 3954 Data size: 47450 Basic stats: COMPLETE Column stats: NONE | | Please refer to the previous Filter Operator [FIL_1015] | |<-Map Join Operator [MAPJOIN_1022] | | condition map:[{"":"Inner Join 0 to 1"}] | | HybridGraceHashJoin:true | | keys:{"Map 29":"_col5 (type: bigint)","Map 32":"ca_address_sk (type: bigint)"} | | outputColumnNames:["_col5","_col6","_col9","_col22","_col26","_col70"] | | Statistics:Num rows: 267936 Data size: 2143505 Basic stats: COMPLETE Column stats: NONE | |<-Map 32 [BROADCAST_EDGE] | | Reduce Output Operator [RS_139] | | key expressions:ca_address_sk (type: bigint) | | Map-reduce partition columns:ca_address_sk (type: bigint) | | sort order:+ | | Statistics:Num rows: 3843 Data size: 115290 Basic stats: COMPLETE Column stats: NONE | | Filter Operator [FIL_257] | | predicate:(ca_address_sk is not null and (ca_country = 'United States') and (ca_state) IN ('NY', 'OH', 'FL')) (type: boolean) | | Statistics:Num rows: 3843 Data size: 115290 Basic stats: COMPLETE Column stats: NONE | | TableScan [TS_126] | | alias:customer_address | | Statistics:Num rows: 30745 Data size: 922354 Basic stats: COMPLETE Column stats: NONE | |<-Map Join Operator [MAPJOIN_1021] | | condition map:[{"":"Inner Join 0 to 1"}] | | keys:{} | | outputColumnNames:["_col5","_col6","_col9","_col22","_col26"] | | Statistics:Num rows: 243579 Data size: 1948641 Basic stats: COMPLETE Column stats: NONE | |<-Map 31 [BROADCAST_EDGE] vectorized | | Reduce Output Operator [RS_1014] | | sort order: | | Statistics:Num rows: 1 Data size: 56311 Basic stats: COMPLETE Column stats: COMPLETE | | TableScan [TS_125] | | alias:customer_demographics | | Statistics:Num rows: 1 Data size: 56311 Basic stats: COMPLETE Column stats: COMPLETE | |<-Map Join Operator [MAPJOIN_1020] | | condition map:[{"":"Inner Join 0 to 1"}] | | HybridGraceHashJoin:true | | keys:{"Map 29":"ss_store_sk (type: bigint)","Map 30":"s_store_sk (type: bigint)"} | | outputColumnNames:["_col5","_col6","_col9","_col22","_col26"] | | Statistics:Num rows: 221436 Data size: 1771492 Basic stats: COMPLETE Column stats: NONE | |<-Map 30 [BROADCAST_EDGE] vectorized | | Reduce Output Operator [RS_1013] | | key expressions:s_store_sk (type: bigint) | | Map-reduce partition columns:s_store_sk (type: bigint) | | sort order:+ | | Statistics:Num rows: 324 Data size: 2593 Basic stats: COMPLETE Column stats: NONE | | Filter Operator [FIL_1012] | | predicate:s_store_sk is not null (type: boolean) | | Statistics:Num rows: 324 Data size: 2593 Basic stats: COMPLETE Column stats: NONE | | TableScan [TS_124] | | alias:store | | Statistics:Num rows: 648 Data size: 5186 Basic stats: COMPLETE Column stats: NONE | |<-Filter Operator [FIL_1019] | predicate:(ss_store_sk is not null and ss_addr_sk is not null and ss_net_profit BETWEEN 50 AND 25000) (type: boolean) | Statistics:Num rows: 201306 Data size: 1610448 Basic stats: COMPLETE Column stats: PARTIAL | TableScan [TS_123] | alias:store_sales | Statistics:Num rows: 1610447 Data size: 212697869 Basic stats: COMPLETE Column stats: PARTIAL |<-Map 1 [CONTAINS] vectorized | Reduce Output Operator [RS_151] | key expressions:_col0 (type: int) | Map-reduce partition columns:_col0 (type: int) | sort order:+ | Statistics:Num rows: 221046 Data size: 1768380 Basic stats: COMPLETE Column stats: NONE | Group By Operator [OP_1043] | keys:_col0 (type: int) | outputColumnNames:["_col0"] | Statistics:Num rows: 221046 Data size: 1768380 Basic stats: COMPLETE Column stats: NONE | Select Operator [OP_1042] | outputColumnNames:["_col0"] | Statistics:Num rows: 36841 Data size: 294730 Basic stats: COMPLETE Column stats: NONE | Filter Operator [FIL_1041] | predicate:((_col58 = _col3) and (_col22 = _col86) and (_col26 = _col6)) (type: boolean) | Statistics:Num rows: 36841 Data size: 294730 Basic stats: COMPLETE Column stats: NONE | Map Join Operator [MAPJOIN_1040] | | condition map:[{"":"Inner Join 0 to 1"}] | | HybridGraceHashJoin:true | | keys:{"Map 1":"_col22 (type: bigint)","Map 8":"d_date_sk (type: bigint)"} | | outputColumnNames:["_col3","_col6","_col9","_col22","_col26","_col58","_col86"] | | Statistics:Num rows: 294729 Data size: 2357855 Basic stats: COMPLETE Column stats: NONE | |<-Map 8 [BROADCAST_EDGE] vectorized | | Reduce Output Operator [RS_1033] | | key expressions:d_date_sk (type: bigint) | | Map-reduce partition columns:d_date_sk (type: bigint) | | sort order:+ | | Statistics:Num rows: 3954 Data size: 47450 Basic stats: COMPLETE Column stats: NONE | | Filter Operator [FIL_1032] | | predicate:(d_date_sk is not null and (d_year = 1998)) (type: boolean) | | Statistics:Num rows: 3954 Data size: 47450 Basic stats: COMPLETE Column stats: NONE | | TableScan [TS_4] | | alias:date_dim | | Statistics:Num rows: 15816 Data size: 189801 Basic stats: COMPLETE Column stats: NONE | | Dynamic Partitioning Event Operator [EVENT_261] | | Statistics:Num rows: 3954 Data size: 47450 Basic stats: COMPLETE Column stats: NONE | | Group By Operator [OP_1035] | | keys:_col0 (type: bigint) | | outputColumnNames:["_col0"] | | Statistics:Num rows: 3954 Data size: 47450 Basic stats: COMPLETE Column stats: NONE | | Select Operator [OP_1034] | | outputColumnNames:["_col0"] | | Statistics:Num rows: 3954 Data size: 47450 Basic stats: COMPLETE Column stats: NONE | | Please refer to the previous Filter Operator [FIL_1032] | |<-Map Join Operator [MAPJOIN_1039] | | condition map:[{"":"Inner Join 0 to 1"}] | | keys:{} | | outputColumnNames:["_col3","_col6","_col9","_col22","_col26","_col58"] | | Statistics:Num rows: 267936 Data size: 2143505 Basic stats: COMPLETE Column stats: NONE | |<-Map 7 [BROADCAST_EDGE] vectorized | | Reduce Output Operator [RS_1031] | | sort order: | | Statistics:Num rows: 1 Data size: 922354 Basic stats: COMPLETE Column stats: COMPLETE | | TableScan [TS_3] | | alias:customer_address | | Statistics:Num rows: 1 Data size: 922354 Basic stats: COMPLETE Column stats: COMPLETE | |<-Map Join Operator [MAPJOIN_1038] | | condition map:[{"":"Inner Join 0 to 1"}] | | HybridGraceHashJoin:true | | keys:{"Map 1":"_col3 (type: bigint)","Map 6":"cd_demo_sk (type: bigint)"} | | outputColumnNames:["_col3","_col6","_col9","_col22","_col26","_col58"] | | Statistics:Num rows: 243579 Data size: 1948641 Basic stats: COMPLETE Column stats: NONE | |<-Map 6 [BROADCAST_EDGE] vectorized | | Reduce Output Operator [RS_1030] | | key expressions:cd_demo_sk (type: bigint) | | Map-reduce partition columns:cd_demo_sk (type: bigint) | | sort order:+ | | Statistics:Num rows: 242 Data size: 7020 Basic stats: COMPLETE Column stats: NONE | | Filter Operator [FIL_1029] | | predicate:(cd_demo_sk is not null and (cd_marital_status = 'M') and (cd_education_status = 'Unknown ')) (type: boolean) | | Statistics:Num rows: 242 Data size: 7020 Basic stats: COMPLETE Column stats: NONE | | TableScan [TS_2] | | alias:customer_demographics | | Statistics:Num rows: 1941 Data size: 56311 Basic stats: COMPLETE Column stats: NONE | |<-Map Join Operator [MAPJOIN_1037] | | condition map:[{"":"Inner Join 0 to 1"}] | | HybridGraceHashJoin:true | | keys:{"Map 1":"ss_store_sk (type: bigint)","Map 5":"s_store_sk (type: bigint)"} | | outputColumnNames:["_col3","_col6","_col9","_col22","_col26"] | | Statistics:Num rows: 221436 Data size: 1771492 Basic stats: COMPLETE Column stats: NONE | |<-Map 5 [BROADCAST_EDGE] vectorized | | Reduce Output Operator [RS_1028] | | key expressions:s_store_sk (type: bigint) | | Map-reduce partition columns:s_store_sk (type: bigint) | | sort order:+ | | Statistics:Num rows: 324 Data size: 2593 Basic stats: COMPLETE Column stats: NONE | | Filter Operator [FIL_1027] | | predicate:s_store_sk is not null (type: boolean) | | Statistics:Num rows: 324 Data size: 2593 Basic stats: COMPLETE Column stats: NONE | | TableScan [TS_1] | | alias:store | | Statistics:Num rows: 648 Data size: 5186 Basic stats: COMPLETE Column stats: NONE | |<-Filter Operator [FIL_1036] | predicate:(ss_store_sk is not null and ss_cdemo_sk is not null and ss_sales_price BETWEEN 100.0 AND 150.0) (type: boolean) | Statistics:Num rows: 201306 Data size: 1610448 Basic stats: COMPLETE Column stats: PARTIAL | TableScan [TS_0] | alias:store_sales | Statistics:Num rows: 1610447 Data size: 212697869 Basic stats: COMPLETE Column stats: PARTIAL |<-Map 14 [CONTAINS] vectorized | Reduce Output Operator [RS_151] | key expressions:_col0 (type: int) | Map-reduce partition columns:_col0 (type: int) | sort order:+ | Statistics:Num rows: 221046 Data size: 1768380 Basic stats: COMPLETE Column stats: NONE | Group By Operator [OP_1092] | keys:_col0 (type: int) | outputColumnNames:["_col0"] | Statistics:Num rows: 221046 Data size: 1768380 Basic stats: COMPLETE Column stats: NONE | Select Operator [OP_1091] | outputColumnNames:["_col0"] | Statistics:Num rows: 36841 Data size: 294730 Basic stats: COMPLETE Column stats: NONE | Filter Operator [FIL_1090] | predicate:((_col58 = _col3) and (_col22 = _col86) and (_col26 = _col6)) (type: boolean) | Statistics:Num rows: 36841 Data size: 294730 Basic stats: COMPLETE Column stats: NONE | Map Join Operator [MAPJOIN_1089] | | condition map:[{"":"Inner Join 0 to 1"}] | | HybridGraceHashJoin:true | | keys:{"Map 18":"d_date_sk (type: bigint)","Map 14":"_col22 (type: bigint)"} | | outputColumnNames:["_col3","_col6","_col9","_col22","_col26","_col58","_col86"] | | Statistics:Num rows: 294729 Data size: 2357855 Basic stats: COMPLETE Column stats: NONE | |<-Map 18 [BROADCAST_EDGE] vectorized | | Reduce Output Operator [RS_1082] | | key expressions:d_date_sk (type: bigint) | | Map-reduce partition columns:d_date_sk (type: bigint) | | sort order:+ | | Statistics:Num rows: 3954 Data size: 47450 Basic stats: COMPLETE Column stats: NONE | | Filter Operator [FIL_1081] | | predicate:(d_date_sk is not null and (d_year = 1998)) (type: boolean) | | Statistics:Num rows: 3954 Data size: 47450 Basic stats: COMPLETE Column stats: NONE | | TableScan [TS_53] | | alias:date_dim | | Statistics:Num rows: 15816 Data size: 189801 Basic stats: COMPLETE Column stats: NONE | | Dynamic Partitioning Event Operator [EVENT_267] | | Statistics:Num rows: 3954 Data size: 47450 Basic stats: COMPLETE Column stats: NONE | | Group By Operator [OP_1084] | | keys:_col0 (type: bigint) | | outputColumnNames:["_col0"] | | Statistics:Num rows: 3954 Data size: 47450 Basic stats: COMPLETE Column stats: NONE | | Select Operator [OP_1083] | | outputColumnNames:["_col0"] | | Statistics:Num rows: 3954 Data size: 47450 Basic stats: COMPLETE Column stats: NONE | | Please refer to the previous Filter Operator [FIL_1081] | |<-Map Join Operator [MAPJOIN_1088] | | condition map:[{"":"Inner Join 0 to 1"}] | | keys:{} | | outputColumnNames:["_col3","_col6","_col9","_col22","_col26","_col58"] | | Statistics:Num rows: 267936 Data size: 2143505 Basic stats: COMPLETE Column stats: NONE | |<-Map 17 [BROADCAST_EDGE] vectorized | | Reduce Output Operator [RS_1080] | | sort order: | | Statistics:Num rows: 1 Data size: 922354 Basic stats: COMPLETE Column stats: COMPLETE | | TableScan [TS_52] | | alias:customer_address | | Statistics:Num rows: 1 Data size: 922354 Basic stats: COMPLETE Column stats: COMPLETE | |<-Map Join Operator [MAPJOIN_1087] | | condition map:[{"":"Inner Join 0 to 1"}] | | HybridGraceHashJoin:true | | keys:{"Map 16":"cd_demo_sk (type: bigint)","Map 14":"_col3 (type: bigint)"} | | outputColumnNames:["_col3","_col6","_col9","_col22","_col26","_col58"] | | Statistics:Num rows: 243579 Data size: 1948641 Basic stats: COMPLETE Column stats: NONE | |<-Map 16 [BROADCAST_EDGE] vectorized | | Reduce Output Operator [RS_1079] | | key expressions:cd_demo_sk (type: bigint) | | Map-reduce partition columns:cd_demo_sk (type: bigint) | | sort order:+ | | Statistics:Num rows: 242 Data size: 7020 Basic stats: COMPLETE Column stats: NONE | | Filter Operator [FIL_1078] | | predicate:(cd_demo_sk is not null and (cd_marital_status = 'D') and (cd_education_status = 'Primary ')) (type: boolean) | | Statistics:Num rows: 242 Data size: 7020 Basic stats: COMPLETE Column stats: NONE | | TableScan [TS_51] | | alias:customer_demographics | | Statistics:Num rows: 1941 Data size: 56311 Basic stats: COMPLETE Column stats: NONE | |<-Map Join Operator [MAPJOIN_1086] | | condition map:[{"":"Inner Join 0 to 1"}] | | HybridGraceHashJoin:true | | keys:{"Map 15":"s_store_sk (type: bigint)","Map 14":"ss_store_sk (type: bigint)"} | | outputColumnNames:["_col3","_col6","_col9","_col22","_col26"] | | Statistics:Num rows: 221436 Data size: 1771492 Basic stats: COMPLETE Column stats: NONE | |<-Map 15 [BROADCAST_EDGE] vectorized | | Reduce Output Operator [RS_1077] | | key expressions:s_store_sk (type: bigint) | | Map-reduce partition columns:s_store_sk (type: bigint) | | sort order:+ | | Statistics:Num rows: 324 Data size: 2593 Basic stats: COMPLETE Column stats: NONE | | Filter Operator [FIL_1076] | | predicate:s_store_sk is not null (type: boolean) | | Statistics:Num rows: 324 Data size: 2593 Basic stats: COMPLETE Column stats: NONE | | TableScan [TS_50] | | alias:store | | Statistics:Num rows: 648 Data size: 5186 Basic stats: COMPLETE Column stats: NONE | |<-Filter Operator [FIL_1085] | predicate:(ss_store_sk is not null and ss_cdemo_sk is not null and ss_sales_price BETWEEN 150.0 AND 200.0) (type: boolean) | Statistics:Num rows: 201306 Data size: 1610448 Basic stats: COMPLETE Column stats: PARTIAL | TableScan [TS_49] | alias:store_sales | Statistics:Num rows: 1610447 Data size: 212697869 Basic stats: COMPLETE Column stats: PARTIAL |<-Map 24 [CONTAINS] vectorized | Reduce Output Operator [RS_151] | key expressions:_col0 (type: int) | Map-reduce partition columns:_col0 (type: int) | sort order:+ | Statistics:Num rows: 221046 Data size: 1768380 Basic stats: COMPLETE Column stats: NONE | Group By Operator [OP_1011] | keys:_col0 (type: int) | outputColumnNames:["_col0"] | Statistics:Num rows: 221046 Data size: 1768380 Basic stats: COMPLETE Column stats: NONE | Select Operator [OP_1010] | outputColumnNames:["_col0"] | Statistics:Num rows: 36841 Data size: 294730 Basic stats: COMPLETE Column stats: NONE | Filter Operator [FIL_1009] | predicate:((_col22 = _col86) and (_col5 = _col70) and (_col26 = _col6)) (type: boolean) | Statistics:Num rows: 36841 Data size: 294730 Basic stats: COMPLETE Column stats: NONE | Map Join Operator [MAPJOIN_1008] | | condition map:[{"":"Inner Join 0 to 1"}] | | HybridGraceHashJoin:true | | keys:{"Map 28":"d_date_sk (type: bigint)","Map 24":"_col22 (type: bigint)"} | | outputColumnNames:["_col5","_col6","_col9","_col22","_col26","_col70","_col86"] | | Statistics:Num rows: 294729 Data size: 2357855 Basic stats: COMPLETE Column stats: NONE | |<-Map 28 [BROADCAST_EDGE] vectorized | | Reduce Output Operator [RS_1001] | | key expressions:d_date_sk (type: bigint) | | Map-reduce partition columns:d_date_sk (type: bigint) | | sort order:+ | | Statistics:Num rows: 3954 Data size: 47450 Basic stats: COMPLETE Column stats: NONE | | Filter Operator [FIL_1000] | | predicate:(d_date_sk is not null and (d_year = 1998)) (type: boolean) | | Statistics:Num rows: 3954 Data size: 47450 Basic stats: COMPLETE Column stats: NONE | | TableScan [TS_102] | | alias:date_dim | | Statistics:Num rows: 15816 Data size: 189801 Basic stats: COMPLETE Column stats: NONE | | Dynamic Partitioning Event Operator [EVENT_273] | | Statistics:Num rows: 3954 Data size: 47450 Basic stats: COMPLETE Column stats: NONE | | Group By Operator [OP_1003] | | keys:_col0 (type: bigint) | | outputColumnNames:["_col0"] | | Statistics:Num rows: 3954 Data size: 47450 Basic stats: COMPLETE Column stats: NONE | | Select Operator [OP_1002] | | outputColumnNames:["_col0"] | | Statistics:Num rows: 3954 Data size: 47450 Basic stats: COMPLETE Column stats: NONE | | Please refer to the previous Filter Operator [FIL_1000] | |<-Map Join Operator [MAPJOIN_1007] | | condition map:[{"":"Inner Join 0 to 1"}] | | HybridGraceHashJoin:true | | keys:{"Map 27":"ca_address_sk (type: bigint)","Map 24":"_col5 (type: bigint)"} | | outputColumnNames:["_col5","_col6","_col9","_col22","_col26","_col70"] | | Statistics:Num rows: 267936 Data size: 2143505 Basic stats: COMPLETE Column stats: NONE | |<-Map 27 [BROADCAST_EDGE] | | Reduce Output Operator [RS_114] | | key expressions:ca_address_sk (type: bigint) | | Map-reduce partition columns:ca_address_sk (type: bigint) | | sort order:+ | | Statistics:Num rows: 3843 Data size: 115290 Basic stats: COMPLETE Column stats: NONE | | Filter Operator [FIL_252] | | predicate:(ca_address_sk is not null and (ca_country = 'United States') and (ca_state) IN ('TX', 'KY', 'SD')) (type: boolean) | | Statistics:Num rows: 3843 Data size: 115290 Basic stats: COMPLETE Column stats: NONE | | TableScan [TS_101] | | alias:customer_address | | Statistics:Num rows: 30745 Data size: 922354 Basic stats: COMPLETE Column stats: NONE | |<-Map Join Operator [MAPJOIN_1006] | | condition map:[{"":"Inner Join 0 to 1"}] | | keys:{} | | outputColumnNames:["_col5","_col6","_col9","_col22","_col26"] | | Statistics:Num rows: 243579 Data size: 1948641 Basic stats: COMPLETE Column stats: NONE | |<-Map 26 [BROADCAST_EDGE] vectorized | | Reduce Output Operator [RS_999] | | sort order: | | Statistics:Num rows: 1 Data size: 56311 Basic stats: COMPLETE Column stats: COMPLETE | | TableScan [TS_100] | | alias:customer_demographics | | Statistics:Num rows: 1 Data size: 56311 Basic stats: COMPLETE Column stats: COMPLETE | |<-Map Join Operator [MAPJOIN_1005] | | condition map:[{"":"Inner Join 0 to 1"}] | | HybridGraceHashJoin:true | | keys:{"Map 25":"s_store_sk (type: bigint)","Map 24":"ss_store_sk (type: bigint)"} | | outputColumnNames:["_col5","_col6","_col9","_col22","_col26"] | | Statistics:Num rows: 221436 Data size: 1771492 Basic stats: COMPLETE Column stats: NONE | |<-Map 25 [BROADCAST_EDGE] vectorized | | Reduce Output Operator [RS_998] | | key expressions:s_store_sk (type: bigint) | | Map-reduce partition columns:s_store_sk (type: bigint) | | sort order:+ | | Statistics:Num rows: 324 Data size: 2593 Basic stats: COMPLETE Column stats: NONE | | Filter Operator [FIL_997] | | predicate:s_store_sk is not null (type: boolean) | | Statistics:Num rows: 324 Data size: 2593 Basic stats: COMPLETE Column stats: NONE | | TableScan [TS_99] | | alias:store | | Statistics:Num rows: 648 Data size: 5186 Basic stats: COMPLETE Column stats: NONE | |<-Filter Operator [FIL_1004] | predicate:(ss_store_sk is not null and ss_addr_sk is not null and ss_net_profit BETWEEN 150 AND 3000) (type: boolean) | Statistics:Num rows: 201306 Data size: 1610448 Basic stats: COMPLETE Column stats: PARTIAL | TableScan [TS_98] | alias:store_sales | Statistics:Num rows: 1610447 Data size: 212697869 Basic stats: COMPLETE Column stats: PARTIAL |<-Map 9 [CONTAINS] vectorized Reduce Output Operator [RS_151] key expressions:_col0 (type: int) Map-reduce partition columns:_col0 (type: int) sort order:+ Statistics:Num rows: 221046 Data size: 1768380 Basic stats: COMPLETE Column stats: NONE Group By Operator [OP_1075] keys:_col0 (type: int) outputColumnNames:["_col0"] Statistics:Num rows: 221046 Data size: 1768380 Basic stats: COMPLETE Column stats: NONE Select Operator [OP_1074] outputColumnNames:["_col0"] Statistics:Num rows: 36841 Data size: 294730 Basic stats: COMPLETE Column stats: NONE Filter Operator [FIL_1073] predicate:((_col58 = _col3) and (_col22 = _col86) and (_col26 = _col6)) (type: boolean) Statistics:Num rows: 36841 Data size: 294730 Basic stats: COMPLETE Column stats: NONE Map Join Operator [MAPJOIN_1072] | condition map:[{"":"Inner Join 0 to 1"}] | HybridGraceHashJoin:true | keys:{"Map 13":"d_date_sk (type: bigint)","Map 9":"_col22 (type: bigint)"} | outputColumnNames:["_col3","_col6","_col9","_col22","_col26","_col58","_col86"] | Statistics:Num rows: 294729 Data size: 2357855 Basic stats: COMPLETE Column stats: NONE |<-Map 13 [BROADCAST_EDGE] vectorized | Reduce Output Operator [RS_1065] | key expressions:d_date_sk (type: bigint) | Map-reduce partition columns:d_date_sk (type: bigint) | sort order:+ | Statistics:Num rows: 3954 Data size: 47450 Basic stats: COMPLETE Column stats: NONE | Filter Operator [FIL_1064] | predicate:(d_date_sk is not null and (d_year = 1998)) (type: boolean) | Statistics:Num rows: 3954 Data size: 47450 Basic stats: COMPLETE Column stats: NONE | TableScan [TS_28] | alias:date_dim | Statistics:Num rows: 15816 Data size: 189801 Basic stats: COMPLETE Column stats: NONE | Dynamic Partitioning Event Operator [EVENT_264] | Statistics:Num rows: 3954 Data size: 47450 Basic stats: COMPLETE Column stats: NONE | Group By Operator [OP_1067] | keys:_col0 (type: bigint) | outputColumnNames:["_col0"] | Statistics:Num rows: 3954 Data size: 47450 Basic stats: COMPLETE Column stats: NONE | Select Operator [OP_1066] | outputColumnNames:["_col0"] | Statistics:Num rows: 3954 Data size: 47450 Basic stats: COMPLETE Column stats: NONE | Please refer to the previous Filter Operator [FIL_1064] |<-Map Join Operator [MAPJOIN_1071] | condition map:[{"":"Inner Join 0 to 1"}] | keys:{} | outputColumnNames:["_col3","_col6","_col9","_col22","_col26","_col58"] | Statistics:Num rows: 267936 Data size: 2143505 Basic stats: COMPLETE Column stats: NONE |<-Map 12 [BROADCAST_EDGE] vectorized | Reduce Output Operator [RS_1063] | sort order: | Statistics:Num rows: 1 Data size: 922354 Basic stats: COMPLETE Column stats: COMPLETE | TableScan [TS_27] | alias:customer_address | Statistics:Num rows: 1 Data size: 922354 Basic stats: COMPLETE Column stats: COMPLETE |<-Map Join Operator [MAPJOIN_1070] | condition map:[{"":"Inner Join 0 to 1"}] | HybridGraceHashJoin:true | keys:{"Map 11":"cd_demo_sk (type: bigint)","Map 9":"_col3 (type: bigint)"} | outputColumnNames:["_col3","_col6","_col9","_col22","_col26","_col58"] | Statistics:Num rows: 243579 Data size: 1948641 Basic stats: COMPLETE Column stats: NONE |<-Map 11 [BROADCAST_EDGE] vectorized | Reduce Output Operator [RS_1062] | key expressions:cd_demo_sk (type: bigint) | Map-reduce partition columns:cd_demo_sk (type: bigint) | sort order:+ | Statistics:Num rows: 242 Data size: 7020 Basic stats: COMPLETE Column stats: NONE | Filter Operator [FIL_1061] | predicate:(cd_demo_sk is not null and (cd_marital_status = 'W') and (cd_education_status = 'College ')) (type: boolean) | Statistics:Num rows: 242 Data size: 7020 Basic stats: COMPLETE Column stats: NONE | TableScan [TS_26] | alias:customer_demographics | Statistics:Num rows: 1941 Data size: 56311 Basic stats: COMPLETE Column stats: NONE |<-Map Join Operator [MAPJOIN_1069] | condition map:[{"":"Inner Join 0 to 1"}] | HybridGraceHashJoin:true | keys:{"Map 10":"s_store_sk (type: bigint)","Map 9":"ss_store_sk (type: bigint)"} | outputColumnNames:["_col3","_col6","_col9","_col22","_col26"] | Statistics:Num rows: 221436 Data size: 1771492 Basic stats: COMPLETE Column stats: NONE |<-Map 10 [BROADCAST_EDGE] vectorized | Reduce Output Operator [RS_1060] | key expressions:s_store_sk (type: bigint) | Map-reduce partition columns:s_store_sk (type: bigint) | sort order:+ | Statistics:Num rows: 324 Data size: 2593 Basic stats: COMPLETE Column stats: NONE | Filter Operator [FIL_1059] | predicate:s_store_sk is not null (type: boolean) | Statistics:Num rows: 324 Data size: 2593 Basic stats: COMPLETE Column stats: NONE | TableScan [TS_25] | alias:store | Statistics:Num rows: 648 Data size: 5186 Basic stats: COMPLETE Column stats: NONE |<-Filter Operator [FIL_1068] predicate:(ss_store_sk is not null and ss_cdemo_sk is not null and ss_sales_price BETWEEN 50.0 AND 100.0) (type: boolean) Statistics:Num rows: 201306 Data size: 1610448 Basic stats: COMPLETE Column stats: PARTIAL TableScan [TS_24] alias:store_sales Statistics:Num rows: 1610447 Data size: 212697869 Basic stats: COMPLETE Column stats: PARTIAL Time taken: 16.207 seconds, Fetched: 628 row(s) |
执行路径的缩略图
任务执行卡顿图
2、union条件和的查询计划
Vertex dependency in root stage Map 1 <- Map 3 (BROADCAST_EDGE), Map 4 (BROADCAST_EDGE), Map 5 (BROADCAST_EDGE), Map 6 (BROADCAST_EDGE) Reducer 2 <- Map 1 (SIMPLE_EDGE) Stage-0 Fetch Operator limit:-1 Stage-1 Reducer 2 File Output Operator [FS_26] compressed:false Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"} Group By Operator [GBY_24] | aggregations:["sum(VALUE._col0)"] | outputColumnNames:["_col0"] | Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |<-Map 1 [SIMPLE_EDGE] Reduce Output Operator [RS_23] sort order: Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE value expressions:_col0 (type: bigint) Group By Operator [GBY_22] aggregations:["sum(_col9)"] outputColumnNames:["_col0"] Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE Select Operator [SEL_21] outputColumnNames:["_col9"] Statistics:Num rows: 5424 Data size: 43392 Basic stats: COMPLETE Column stats: NONE Filter Operator [FIL_35] predicate:((((_col58 = _col3) and (_col60 = 'M') and (_col61 = 'Unknown ') and _col12 BETWEEN 100.0 AND 150.0) or ((_col58 = _col3) and (_col60 = 'W') and (_col61 = 'College ') and _col12 BETWEEN 50.0 AND 100.0) or ((_col58 = _col3) and (_col60 = 'D') and (_col61 = 'Primary ') and _col12 BETWEEN 150.0 AND 200.0)) and (((_col5 = _col70) and (_col80 = 'United States') and (_col78) IN ('MI', 'GA', 'NH') and _col21 BETWEEN 0 AND 2000) or ((_col5 = _col70) and (_col80 = 'United States') and (_col78) IN ('TX', 'KY', 'SD') and _col21 BETWEEN 150 AND 3000) or ((_col5 = _col70) and (_col80 = 'United States') and (_col78) IN ('NY', 'OH', 'FL') and _col21 BETWEEN 50 AND 25000)) and (_col22 = _col86) and (_col26 = _col6)) (type: boolean) Statistics:Num rows: 5424 Data size: 43392 Basic stats: COMPLETE Column stats: NONE Map Join Operator [MAPJOIN_45] | condition map:[{"":"Inner Join 0 to 1"}] | HybridGraceHashJoin:true | keys:{"Map 1":"_col22 (type: bigint)","Map 6":"d_date_sk (type: bigint)"} | outputColumnNames:["_col3","_col5","_col6","_col9","_col12","_col21","_col22","_col26","_col58","_col60","_col61","_col70","_col78","_col80","_col86"] | Statistics:Num rows: 617224 Data size: 4937798 Basic stats: COMPLETE Column stats: NONE |<-Map 6 [BROADCAST_EDGE] vectorized | Reduce Output Operator [RS_55] | key expressions:d_date_sk (type: bigint) | Map-reduce partition columns:d_date_sk (type: bigint) | sort order:+ | Statistics:Num rows: 3954 Data size: 47450 Basic stats: COMPLETE Column stats: NONE | Filter Operator [FIL_54] | predicate:(d_date_sk is not null and (d_year = 1998)) (type: boolean) | Statistics:Num rows: 3954 Data size: 47450 Basic stats: COMPLETE Column stats: NONE | TableScan [TS_4] | alias:date_dim | Statistics:Num rows: 15816 Data size: 189801 Basic stats: COMPLETE Column stats: NONE | Dynamic Partitioning Event Operator [EVENT_41] | Statistics:Num rows: 3954 Data size: 47450 Basic stats: COMPLETE Column stats: NONE | Group By Operator [OP_57] | keys:_col0 (type: bigint) | outputColumnNames:["_col0"] | Statistics:Num rows: 3954 Data size: 47450 Basic stats: COMPLETE Column stats: NONE | Select Operator [OP_56] | outputColumnNames:["_col0"] | Statistics:Num rows: 3954 Data size: 47450 Basic stats: COMPLETE Column stats: NONE | Please refer to the previous Filter Operator [FIL_54] |<-Map Join Operator [MAPJOIN_44] | condition map:[{"":"Inner Join 0 to 1"}] | keys:{} | outputColumnNames:["_col3","_col5","_col6","_col9","_col12","_col21","_col22","_col26","_col58","_col60","_col61","_col70","_col78","_col80"] | Statistics:Num rows: 561113 Data size: 4488908 Basic stats: COMPLETE Column stats: NONE |<-Map 5 [BROADCAST_EDGE] vectorized | Reduce Output Operator [RS_53] | sort order: | Statistics:Num rows: 30745 Data size: 922354 Basic stats: COMPLETE Column stats: NONE | value expressions:ca_address_sk (type: bigint), ca_state (type: char(2)), ca_country (type: varchar(20)) | TableScan [TS_3] | alias:customer_address | Statistics:Num rows: 30745 Data size: 922354 Basic stats: COMPLETE Column stats: NONE |<-Map Join Operator [MAPJOIN_43] | condition map:[{"":"Inner Join 0 to 1"}] | keys:{} | outputColumnNames:["_col3","_col5","_col6","_col9","_col12","_col21","_col22","_col26","_col58","_col60","_col61"] | Statistics:Num rows: 510103 Data size: 4080826 Basic stats: COMPLETE Column stats: NONE |<-Map 4 [BROADCAST_EDGE] vectorized | Reduce Output Operator [RS_52] | sort order: | Statistics:Num rows: 1941 Data size: 56311 Basic stats: COMPLETE Column stats: NONE | value expressions:cd_demo_sk (type: bigint), cd_marital_status (type: char(1)), cd_education_status (type: char(20)) | TableScan [TS_2] | alias:customer_demographics | Statistics:Num rows: 1941 Data size: 56311 Basic stats: COMPLETE Column stats: NONE |<-Map Join Operator [MAPJOIN_42] | condition map:[{"":"Inner Join 0 to 1"}] | HybridGraceHashJoin:true | keys:{"Map 3":"s_store_sk (type: bigint)","Map 1":"ss_store_sk (type: bigint)"} | outputColumnNames:["_col3","_col5","_col6","_col9","_col12","_col21","_col22","_col26"] | Statistics:Num rows: 463730 Data size: 3709842 Basic stats: COMPLETE Column stats: NONE |<-Map 3 [BROADCAST_EDGE] vectorized | Reduce Output Operator [RS_51] | key expressions:s_store_sk (type: bigint) | Map-reduce partition columns:s_store_sk (type: bigint) | sort order:+ | Statistics:Num rows: 324 Data size: 2593 Basic stats: COMPLETE Column stats: NONE | Filter Operator [FIL_50] | predicate:s_store_sk is not null (type: boolean) | Statistics:Num rows: 324 Data size: 2593 Basic stats: COMPLETE Column stats: NONE | TableScan [TS_1] | alias:store | Statistics:Num rows: 648 Data size: 5186 Basic stats: COMPLETE Column stats: NONE |<-Filter Operator [FIL_36] predicate:ss_store_sk is not null (type: boolean) Statistics:Num rows: 421573 Data size: 3372584 Basic stats: COMPLETE Column stats: PARTIAL TableScan [TS_0] alias:store_sales Statistics:Num rows: 843146 Data size: 212697869 Basic stats: COMPLETE Column stats: PARTIAL Time taken: 3.062 seconds, Fetched: 113 row(s) |
执行路径的缩略图
5.4.2 内存分配问题
通过对任务的资源优化,分析出影响此任务执行成功的关键因素是JVM堆内存的占比问题。通过调整apache Tez引擎的hive.tez.java.opts参数,将影响任务的成功执行。
1、任务执行失败日志
内存调整前
WeightedScalingMemoryDistributor类的官网说明
WeightedScalingMemoryDistributor类源码片段
2、任务执行成功日志
内存调整后
6 解决方案
6.1 资源内存合理配比
通过案例分析章节得出影响任务执行的根因之一为JVM堆内存分配问题,因此需要依据资源分配整体策略,按比例调整相关服务资源分配,保证调整JVM相关参数的基础上,不影响整体资源分配策略。
- 计算预留内存
每个节点总内存 | 推荐预留系统内存 | HBase推荐预留内存 |
32GB | 10GB | 4GB |
- 每个节点最大容器数量
containers = min (2*CORES, 1.8*DISKS, (Total available RAM) / MIN_CONTAINER_SIZE)
= min(2 * 8 ,1.8 * 10 , (32 - 22 ) / 2) = 5
- 计算内存大小
RAM-per-container = max(MIN_CONTAINER_SIZE, (Total Available RAM) / containers))
= max(2GB ,(32 -22 )/ 2 ) = 2GB
- 计算其他yarn、mapreduce参数
yarn可使用的物理内存总量 yarn.nodemanager.resource.memory-mb = containers * RAM-per-container = 5 * 2 = 10 GB 单个任务可申请的最少物理内存量 yarn.scheduler.minimum-allocation-mb = RAM-per-container= 2 GB 单个任务可申请的最大物理内存量 yarn.scheduler.maximum-allocation-mb = containers * RAM-per-container = 5 * 2 = 10 GB 单个map任务申请内存资源 mapreduce.map.memory.mb = RAM-per-container = 2 GB 根据container的值自动调整 单个Reduce任务可使用的资源上限 mapreduce.reduce.memory.mb = 2 * RAM-per-container = 2 * 2 GB = 4GB 根据container的值自动调整 Map任务的 JVM 参数 小于mapreduce.map.memory.mb mapreduce.map.java.opts = 0.8 * RAM-per-container = 0.8 * 2 GB = 1.6 GB 根据container的值自动调整 Reduce任务的JVM参数 小于mapreduce.reduce.memory.mb mapreduce.reduce.java.opts = 0.8 * 2 * RAM-per-container = 0.8 * 2 * 2 GB = 3.2 GB 根据container的值自动调整 MR运行于YARN上时,为AM分配的内存 yarn.app.mapreduce.am.resource.mb = 2 * RAM-per-container = 2 * 2 GB = 4 GB 根据container的值自动调整 app的JVM参数 yarn.app.mapreduce.am.command-opts = 0.8 * 2 * RAM-per-container = 0.8 * 2* 2 GB = 3.2 GB 根据container的值自动调整 修改了最低最大container的大小值后 下列值会自动调整 问题:为什么公式计算的值与系统推荐值不一样,发现个别参数推荐值是计算值的两倍,这是因为预留内存的问题。 hive.auto.convert.join.noconditionaltask.size 1145324612 自动调整 hive.tez.container.size 4096 自动调整 mapreduce.map.java.opts -Xmx3276m 1.6 GB mapreduce.reduce.java.opts -Xmx6553m 3.2 GB yarn.app.mapreduce.am.command-opts -Xmx3276m -Dhdp.version=${hdp.version} 3.2 GB yarn.app.mapreduce.am.resource.mb 4096 4GB mapreduce.map.memory.mb 4096 2 GB mapreduce.reduce.memory.mb 8192 4 GB tez.runtime.io.sort.mb 1081 自动调整 tez.runtime.unordered.output.buffer.size-mb 3.7 自动调整 tez.task.resource.memory.mb 4096 自动调整 |
6.2 SQL语句改造优化
通过堆案例分析章节得出影响任务执行的另一根因为Hive的执行路径问题。因此,需要对Hive原执行任务中的SQL语句加以改造,并保持语义一致。
由原先的or条件组合改造为union条件组合,如下:
select sum(ss_quantity) from ( select ss_quantity from ( select ss_quantity from store_sales, store, customer_demographics, customer_address, date_dim where s_store_sk = ss_store_sk and ss_sold_date_sk = d_date_sk and d_year = 1998 and ( cd_demo_sk = ss_cdemo_sk and cd_marital_status = 'M' and cd_education_status = 'Unknown' and ss_sales_price between 100.00 and 150.00 ) union all select ss_quantity from store_sales, store, customer_demographics, customer_address, date_dim where s_store_sk = ss_store_sk and ss_sold_date_sk = d_date_sk and d_year = 1998 and ( cd_demo_sk = ss_cdemo_sk and cd_marital_status = 'W' and cd_education_status = 'College' and ss_sales_price between 50.00 and 100.00 ) union all select ss_quantity from store_sales, store, customer_demographics, customer_address, date_dim where s_store_sk = ss_store_sk and ss_sold_date_sk = d_date_sk and d_year = 1998 and ( cd_demo_sk = ss_cdemo_sk and cd_marital_status = 'D' and cd_education_status = 'Primary' and ss_sales_price between 150.00 and 200.00 ) ) t1 union select ss_quantity from ( select ss_quantity from store_sales, store, customer_demographics, customer_address, date_dim where s_store_sk = ss_store_sk and ss_sold_date_sk = d_date_sk and d_year = 1998 and ( ss_addr_sk = ca_address_sk and ca_country = 'United States' and ca_state in ('MI', 'GA', 'NH') and ss_net_profit between 0 and 2000 ) union all select ss_quantity from store_sales, store, customer_demographics, customer_address, date_dim where s_store_sk = ss_store_sk and ss_sold_date_sk = d_date_sk and d_year = 1998 and (ss_addr_sk = ca_address_sk and ca_country = 'United States' and ca_state in ('TX', 'KY', 'SD') and ss_net_profit between 150 and 3000 ) union all select ss_quantity from store_sales, store, customer_demographics, customer_address, date_dim where s_store_sk = ss_store_sk and ss_sold_date_sk = d_date_sk and d_year = 1998 and (ss_addr_sk = ca_address_sk and ca_country = 'United States' and ca_state in ('NY', 'OH', 'FL') and ss_net_profit between 50 and 25000 ) ) t2 )t |
6.3 成功执行
7 案例总结
虽然理论上百万级数据量的Hive离线计算任务一般不会引起内存溢出问题,但由于过于复杂的查询条件,以及“or条件组合”会产生多个笛卡尔积,导致中间结果成指数级数据量的暴增,达到数亿、数十亿甚至百亿级的数据量,从而引起内存不足,频繁GC,呈现的结果是内存溢出,任务执行卡顿。
另外,即使通过资源优化内存调优的方式满足了本案例Hive+Tez离线计算任务的要求,但由于底层依然是基于磁盘读写数据,I/O也是性能瓶颈,本案例经过调优后的成功执行花费了2小时58分39秒。
因此, 通过对本案例的分析,对于复杂查询条件的Hive离线计算任务有如下几点建议:
1、不建议使用“or条件组合”,建议采用替代方案,即“union all条件组合”。如果必须使用“or条件组合”,则建议数据量不宜过大并且查询条件不宜过于复杂。
2、Hive的复杂计算任务建议采用Spark计算引擎,因为Spark计算框架是基于内存进行计算,I/0一般不会是性能瓶颈,另外Spark计算框架大量采用迭代器模式,原始数据基本不会进入内存,因此内存要求相对原始数据会小很多,本案例在同样的环境下,采用Spark引擎没有出现内存溢出,并且成功执行只花费了126秒。