hive.exec.orc.default.stripe.size相关

凌不了云

已于 2024-09-12 10:10:01 修改

阅读量709

点赞数 27

分类专栏： Hive 文章标签： hive hadoop spark

于 2023-12-26 09:52:12 首次发布

本文链接：https://blog.csdn.net/xingyuelhb/article/details/135214489

版权

Hive 专栏收录该内容

17 篇文章 0 订阅

订阅专栏

1.背景：

在日常的处理中发现了Warning: Ignoring non-Spark config property: hive.exec.orc.default.stripe.size这样的一个日志，

此处客户寻求解决方式.

2.解决方式：

hive.exec.orc.default.stripe.size, "256*1024*1024 " stripe的默认大小

hive.exec.orc.split.strategy, "BI"

以上这两个参数一起使用.

3.原理剖析：

见配置可以得知，该配置是针对orc进行相关设置的配置---hive.exec.orc

首先我们来看下orc file,

ORC File，它的全名是Optimized Row Columnar (ORC) file，实际上是对RCFile做了一些优化.

这种文件格式可比较高效的来存储Hive数据.

它的设计目标是来克服Hive其他格式的缺陷.

运用ORC File可以提高Hive的读、写以及处理数据的性能.

和RCFile格式相比.

ORC File格式有以下优点：

(1)、每个task只会输出单个文件，这样可以减少NameNode的负载.

(2)、支持各种复杂的数据类型，如： datetime, decimal, 以及一些复杂类型(struct, list, map, and union).

(3)、在文件中存储了一些轻量级的索引数据.

(4)、基于数据类型的块模式压缩：

a、integer类型的列用行程长度编码(run-length encoding);

b、String类型的列用字典编码(dictionary encoding)；

(5)、用多个互相独立的RecordReaders并行读相同的文件；

(6)、无需扫描markers就可以分割文件；

(7)、绑定读写所需要的内存；

(8)、metadata的存储是用 Protocol Buffers的，所以它支持添加和删除一些列.

ORC File包含一组组的行数据，称为stripes，ORC File的file footer还包含一些额外的辅助信息。在ORC File文件的最后，有一个被称为postscript的区，它主要是用来存储压缩参数及压缩页脚的大小。

在默认情况下，一个stripe的大小为250MB.

也就是对应的default设置值：hive.exec.orc.default.stripe.size, "256*1024*1024 " stripe的默认大小

（大尺寸的stripes使得从HDFS读数据更高效）

　　在file footer里面包含了该ORC File文件中stripes的信息，每个stripe中有多少行，以及每列的数据类型。当然，它里面还包含了列级别的一些聚合的结果，比如：count, min, max, and sum.

orc文件架构如下所示：

1)因此在适当增大hive.exec.orc.default.stripe.size, "256*1024*1024"的值可提高效率，就像适当提高blocksize也可以加快效率原理类似.

2)hive.exec.orc.split.strategy, "BI"

2.1)场景使用，通过阅读hive源码可知，此配置hive，spark，tez均适用.

官方提供的配置描述如下所示:

HIVE_ORC_SPLIT_STRATEGY("hive.exec.orc.split.strategy", "HYBRID", new StringSet("HYBRID", "BI", "ETL"),
    "This is not a user level config. BI strategy is used when the requirement is to spend less time in split generation" +
    " as opposed to query execution (split generation does not read or cache file footers)." +
    " ETL strategy is used when spending little more time in split generation is acceptable" +
    " (split generation reads and caches file footers). HYBRID chooses between the above strategies" +
    " based on heuristics.")

复制

因此针对客户的场景，对较大的ORC表，因为其footer较大，ETL策略会导致其从hdfs拉取大量的数据来切分split，甚至会导致driver端OOM，因此这类表的读取建议使用BI策略.

此处：

适当加大hive.exec.orc.default.stripe.size配置，以及在orc表比较大的情况下使用BI策略可有效提高效率，

以及避免driver OOM.

hive.exec.orc.default.stripe.size, "256*1024*1024 " stripe的默认大小

hive.exec.orc.split.strategy, "BI"

spark引擎

spark-sql

--hiveconf hive.exec.orc.default.stripe.size=536870912

--hiveconf hive.exec.orc.default.block.size=536870912

--hiveconf hive.exec.orc.default.buffer.size=16384