Hive-hiveSQL调优

最新推荐文章于 2022-10-17 20:01:05 发布

weixin_34114823

最新推荐文章于 2022-10-17 20:01:05 发布

阅读量291

点赞数

文章标签： java python 设计模式

原文链接：https://my.oschina.net/osenlin/blog/1603056

版权

2019独角兽企业重金招聘Python工程师标准>>>

前言

很早以前也是写过hivesql优化分享，但视角都偏狭隘。这篇希望能够从一个比较高层的视角来看待hive优化。勿赘言，影响HiveSQL性能有俩方面：

SQL转化成MapReduce的算法以及算法执行路径图是否合理，这部分代码社区大牛对优化也已做多次迭代，提升空间有限，故不是咱聊的主要内容，对mr算法有兴趣的推荐看下《MapReuce设计模式》。另外，如果掌握了MapReduce，且开发者有一定的经验积累可以反推Compiler将SQL转换的MapReduce执行算法，并借助explain来比对你构思的解释计划和实际生成的解释计划是否存在差异，并思考差异的原因是啥，慢慢就能够对生成的算法以及算法执行路径图是否合理给出一个自己的评判，同时也能提升自己SQL，MapReduce的写法造诣的。看这个例子，博文的第一种方式应该和写hivesql生成的mapreduce算法是差不多的，效率同样也是很低下。第二种就是借助第三方的数据结构来完成，效率提升了俩个量级。有了已经的积累后，会产生对hivesql，mapreduce取舍的问题，考虑后续维护并且对性能要求不太高用hivesql，如果希望把控较高算法性能，并且有一定的代码造诣，用mapreduce。个人钟爱后者，看到美的代码是一种享受，但是在团队内部我选择前者。
计算框架的运行。hadoop计算框架不只是开发mapreduce，combine等接口，也同时开放了计算框架在调度时候的运行主要参数。

为什么是上面俩点呢，我们来看看Hive架构和MapReduce执行流程

Hive 架构

Hive是构建在Hadoop之上的一个数据管理和操作工具，核心本质是在MapReduce之上抽象了一层SQL-MapReduce的作业转换器，所以我称它为工具，而不是DBMS。我们来看看Hive架构图
整个的结构很简单。客户端提交DML命令给驱动器，驱动器负责生成解释计划，生成前需要调用compiler模块对SQL语句语法树解析，并读取MetaStore生成逻辑执行计划，转换成一系列的MR 任务。然后发送给执行引擎，执行引擎负责将任务提交到集群，执行MR任务。下面是官网原话的引用

Parse and SemanticAnalysis (ql/parse) - This component contains the code for parsing SQL, converting it into Abstract Syntax Trees, converting the Abstract Syntax Trees into Operator Plans and finally converting the operator plans into a directed graph of tasks which are executed by Driver.java.
Optimizer (ql/optimizer) - This component contains some simple rule based optimizations like pruning non referenced columns from table scans (column pruning) that the Hive Query Processor does while converting SQL to a series of map/reduce tasks .
Plan Components (ql/plan) - This component contains the classes (which are called descriptors), that are used by the compiler (Parser, SemanticAnalysis and Optimizer) to pass the information to operator trees that is used by the execution code.
MetaData Layer (ql/metadata) - This component is used by the query processor to interface with the MetaStore in order to retrieve information about tables, partitions and the columns of the table. This information is used by the compiler to compile SQL to a series of map/reduce tasks.
Map/Reduce Execution Engine (ql/exec) - This component contains all the query operators and the framework that is used to invoke those operators from within the map/reduces tasks.

上面话很多，但都是讨论怎么生成SQL解释计划给hadoop去运行。即SQL转化成MapReduce的算法以及算法执行路径图的生成。

MapReduce

网上描绘的基本正确流程图：输入图片说明
数据经过iputformat进行节分格式化，输入map，执行计算，map输出将数据写到缓冲区，并计算分区（上图标注地方有问题），当写到一定的阀值会spill到磁盘，并进行排序，当map执行完会将各个map写出的小文件进行归并排序。map执行完后reduce会起一个fechoutservlet将数据拷贝到reduce节点，并进行合并排序，送入redcue，执行计算。做过服务端开发同学知道，图上的每个节点都是影响mapreduce执行效率的地方。影响在哪里见下一个章节

调优

了解了hive架构与mapreduce的基本运行原理后，我们对hiveSql的调优分为几个层面的东西

数据存储

不需要随机查询行或者列，只是数据流转中的中间环节使用avro，hadoop原生文件类型，以及消息通信的文件协议，避免了频繁的文件转换带来的性能上开销，同时avro自身特性，采用了二进制存储，具有较好序/反列化的能力，在提供了较好的压缩率，又具有较好的压缩以及解压缩的效率。
在需要用到频繁的查询，计算数据存储采用ORCFile文件格式，下图是ORC的文件结构我们可以看到由index，row，footer三部分组成。每块可以单独压缩。此外orcfile有俩个特性：支持ACID，但是请牢记hive是构建在Hadoop之上，他支持单个事务百万级数据更新，却很难做到百万事物少数数据更新；支持索引，采用三级索引：文件，stripe(orcfile数据分配存储的单位)，行，这三个。实际上这里面只是存储了一些统计信息，下面是官网原话的引用:

ORC provides three level of indexes within each file:

file level - statistics about the values in each column acrss the entire file
stripe level - statistics about the values in each column for each stripe
row level - statistics about the values in each column for each set of 10,000 rows within a stripe

统计信息存储的，当前层面最大值，最小值，数据长度（根据不同列类型会有不一样，具体参考官方doc）等。ORC在实现数据查找的时候，可以利用索引过滤掉不需要的文件块，同时在过滤的时候，只需要解压索引那部分，不需要解压所有数据达到快速过滤的效果。这个的设计是对hive非常有用支持和补充，实际上有真正在大数据场景实践的人应该知道，在集群计算的时候磁盘和网络IO是最核心的瓶颈，利用这种结构可以剔除大部分的不需要数据，减少map，shuffle，reduce数据量。 PS（经验分享）：如果写得程序瓶颈是CPU，有且只有俩种情况：要么程序算法有问题，要么就是拿hadoop去做他不擅长的领域。

表设计层面

关闭动态分区。动态分区插入数据，会产生大量的小文件，map数据会增加，同时namenode也需要存储更多元数据信息，检索更多的小文件。还有一个更加隐秘的问题，从A表导入数据到B表，AB俩表的分区列一样，如果这时候偷懒，插入B表开动态分区，hadoop会生成假的reduce个数，真实的reduce个数，也就是处理数据reduce节点和分区数一致，其他的reduce都是空跑。如果导入数据极大，redue个数很少，会产生严重的数据倾斜。解决办法：使用distribute by+静态分区
开启分桶。熟悉分桶的原理应该清楚，数据将按分桶的列hash值一致的归入相同的桶内，提高了数据的内聚性，有利于在离线场景map本地化读，和批量快速加载（hbase的使用场景不一样，这和hbase优化是不一样的）。同时也有利于join执行，为嘛，可以联想下RDBMS中的netsted loop join，hash Join，merge join的区别。如果思维在发散点，俩表的join，其实可以将驱动表类比成索引表，被驱动表必成数据表，要提高回表的效率，聚合因子就要低（可以看到我前面在聊sql优化，有关于聚合因子的描述）。举一反三，只要运行在现有图灵机模型的计算机，RDBMS的优化，经过适当的转换，在nosql中也是可以借鉴的。

SQL层面

借鉴 Multi-Group-By Inserts，可以在一个语句中只读一次表，达到多次转化插入到不同的表。核心是只读一次。
Join采用，记得表设计层面分桶。开启Hive.optimize.skewjoin=true，防止数据倾斜带来的执行效率缓慢。根本原理是在map的output-key前加入一个随机值，达到分散数据目的。采用MapJoin。
使用bloom filter 达到快速过滤数据目的。
groub by，设置Hive.groupby.skewindata=true，防止单group by 过大引起的数据倾斜

count distinct 优化

-- 优化前
Select a,sum(b),count(distinct c),count(distinct d) from test group by a
-- 优化后的语句
Select a ,sum(b),count（c）,count(d) from (
				Select a,b,null c,null d from test
				Union all
				Select a,0 b,c,null d from test group by a,c
				Union all
				Select a,0,null c ,d from group by a,d
)

如果想知道为什么，第二句可以达到优化的目的，请大家思考一个问题：如果用第一个语句，mapreduce需要怎么写，如果是第二个语句mapreduce又应该怎么写。

还有一些其他的规则请看以前写得hive优化的文章，我比较懒不喜欢列规则，我也建议大家不要去记这些规则。实际上只要记住一个原则就可以，代码不耗IO就是优，代码写多，慢慢这些规则都会内化到你日常的思考中，虽然可能在做分享可能不容易想起来。然后在借助解释计划，hadoop日志counter，具体日志就可以帮助我们有针对的解决问题。

Hive Job 层面

开启并行执行，如果一个sql生成了多个stage，且不互相依赖，那么可以实现stage并行执行，eg：union all
```
Set hive.exec.parallel=true;
Set hive.exec.parallel.thread.num=10;
```
合并输入小文件，合并输出小文件
JVM从用，如果大型的job，map几万个是正常，减少jvm的重启，资源分配，回收的资源开销。
压缩map阶段的中间数据，减少传输的磁盘和网络IO的数据量
开启 Hive, Map-Reduce 本地化执行模式，但有三个限制条件

The total input size of the job is lower than: hive.exec.mode.local.auto.inputbytes.max (128MB by default)
The total number of map-tasks is less than: hive.exec.mode.local.auto.tasks.max (4 by default)
The total number of reduce tasks required is 1 or 0.

Map 层面

控制Map个数。通过mapred.map.tasks控制是无效的，map的个数计算有一套规则，
map_num=min(split_num,max(default_num,goal_num)).其中default_num,就是系统计算出的总数据量除以文件块大小。goal_num就是用户期望的大小，即mapred.map.tasks。split_num是总数据量除以切分的文件块体积split_size,其中split_size，是mapred.min.split.size与块大小的最大值。
尽量在map实现聚合，实现数据过滤，可以减少网络磁盘的IO数据量
开启推测执行，Hive.mapred.map.tasks.speculative.execution
将split.size控制在和block一致，防止由于大文件切分引入的网络传输。防止小文件需要消耗更多的map。
还有俩个比较重要的参数

mapreduce.task.io.sort.mb：The cumulative size of the serialization and accounting buffers storing records emitted from the map, in megabytes.
mapreduce.map.sort.spill.percent: The soft limit in the serialization buffer. Once reached, a thread will begin to spill the contents to disk in the background.

io.sort.mb属性来设置缓冲区，当缓冲区中的数据量达到一个特定阀值(io.sort.mb * io.sort.spill.percent，其中io.sort.spill.percent 默认是0.80)时，系统将会启动一个后台线程把缓冲区中的内容spill 到磁盘（spill 文件保存在由mapred.local.dir指定的目录中）,减少spill次数，也能提升map写的过程

shuffle/Reduce 层面

要想对这个阶段进行优化就必然需要对里面的细节做足够多的功课，下面我们来聊聊这个过程：在map的spill 线程在把缓冲区的数据写到磁盘前，会对它进行一个二次快速排序，首先根据数据所属的partition 排序，然后每个partition 中再按Key 排序，在Map任务完成前，所有的spill文件将会被归并排序为一个索引文件和数据文件。这是一个多路归并过程，最大归并路数由io.sort.factor 控制(默认是10)，如果设定了Combiner，并且spill文件的数量至少是3（由min.num.spills.for.combine 属性控制），那么Combiner 将在输出文件被写入磁盘前运行以压缩数据。 Reducer 通过HTTP 来获取对应的数据。用来传输partitions 数据的工作线程个数由tasktracker.http.threads 控制，这个设定是针对每一个TaskTracker 的，并不是单个Map，默认值为40，在运行大作业的大集群上可以增大以提升数据传输速率。拷贝阶段，Reduce任务拥有多个拷贝线程，可以并行的获取Map输出。可以通过设定mapred.reduce.parallel.copies来改变线程数。如果Map输出足够小，他们会被拷贝到Reduce TaskTracker的内存中（缓冲区的大小由mapred.job.shuffle.input.buffer.percnet控制），或者达到了Map输出的阀值的大小(由mapred.inmem.merge.threshold控制)，缓冲区中的数据将会被归并然后spill到磁盘。拷贝来的数据叠加在磁盘上，有一个后台线程会将它们归并为更大的排序文件，这样做节省了后期归并的时间。对于经过压缩的Map 输出，系统会自动把它们解压到内存方便对其执行归并。Reduce 任务进入排序阶段（更恰当的说应该是归并阶段，因为排序在Map 端就已经完成），这个阶段会对所有的Map 输出进行归并排序，这个工作会重复多次才能完成。总结：上面详细描述了shuffle的过程，也介绍了关键的参数，都是优化切入点。如果只是写SQL的人可能理解有困难，所以如果想让自己对技术有更深的理解，也多多接触下服务端的技术吧，嘿嘿~。官网罗列了下面参数，可以参考下：

mapreduce.task.io.soft.factor：Specifies the number of segments on disk to be merged at the same time. It limits the number of open files and compression codecs during merge. If the number of files exceeds this limit, the merge will proceed in several passes. Though this limit also applies to the map, most jobs should be configured so that hitting this limit is unlikely there.
mapreduce.reduce.merge.inmem.thresholds：The number of sorted map outputs fetched into memory before being merged to disk. Like the spill thresholds in the preceding note, this is not defining a unit of partition, but a trigger. In practice, this is usually set very high (1000) or disabled (0), since merging in-memory segments is often less expensive than merging from disk (see notes following this table). This threshold influences only the frequency of in-memory merges during the shuffle.
mapreduce.reduce.shuffle.merge.percent：The memory threshold for fetched map outputs before an in-memory merge is started, expressed as a percentage of memory allocated to storing map outputs in memory. Since map outputs that can’t fit in memory can be stalled, setting this high may decrease parallelism between the fetch and merge. Conversely, values as high as 1.0 have been effective for reduces whose input can fit entirely in memory. This parameter influences only the frequency of in-memory merges during the shuffle.
mapreduce.reduce.shuffle.input.buffer.percent：The percentage of memory- relative to the maximum heapsize as typically specified in mapreduce.reduce.java.opts- that can be allocated to storing map outputs during the shuffle. Though some memory should be set aside for the framework, in general it is advantageous to set this high enough to store large and numerous map outputs.
mapreduce.reduce.input.buffer.percent：The percentage of memory relative to the maximum heapsize in which map outputs may be retained during the reduce. When the reduce begins, map outputs will be merged to disk until those that remain are under the resource limit this defines. By default, all map outputs are merged to disk before the reduce begins to maximize the memory available to the reduce. For less memory-intensive reduces, this should be increased to avoid trips to disk.